Building a "Search Inside PDF" Feature for Your Web App
Ever uploaded a 200-page PDF and needed to find where it mentions "Total" or "Error"? Scrolling through pages manually is painful. Building full-text search from scratch means parsing PDFs, indexing content, and handling edge cases.
There is a much simpler approach: use an API that does the heavy lifting. In this tutorial, we will build a Python function that searches inside any PDF and returns the exact page numbers and context lines where your search term appears.
Quick Start: Search a PDF in 10 Lines
import requests
response = requests.post(
'https://apdf.io/api/pdf/content/search',
headers={
'Authorization': 'Bearer YOUR_API_TOKEN',
'Accept': 'application/json',
'Content-Type': 'application/json'
},
json={
'file': 'https://example.com/document.pdf',
'text': 'invoice'
}
)
print(response.json())
Real-World Use Case: Support Ticket Search
Imagine you are building a support dashboard. Customers upload PDF documents (contracts, invoices, error logs), and your team needs to quickly find specific information without opening each file.
Let us build a reusable function that:
- Takes a PDF URL and search term
- Calls the aPDF.io Search API
- Returns a clean summary of matches
Step 1: Get Your API Token
- Go to aPDF.io and sign up (free).
- Copy your API Token from the dashboard.
Install the requests library if you have not already:
pip install requests
Step 2: The Search Function
import requests
API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/content/search"
def search_pdf(pdf_url, search_term, case_sensitive=False, use_regex=False):
"""
Search for a term inside a PDF and return matches with context.
Args:
pdf_url: URL of the PDF file to search
search_term: Text to search for
case_sensitive: Set True for case-sensitive matching
use_regex: Set True to interpret search_term as a regex pattern
Returns:
dict with search results or error message
"""
payload = {
'file': pdf_url,
'text': search_term
}
if case_sensitive:
payload['case'] = '1'
if use_regex:
payload['regex'] = '1'
try:
response = requests.post(
API_URL,
headers={
'Authorization': f'Bearer {API_TOKEN}',
'Accept': 'application/json',
'Content-Type': 'application/json'
},
json=payload
)
if response.status_code == 200:
return response.json()
else:
return {'error': f'API returned {response.status_code}: {response.text}'}
except Exception as e:
return {'error': str(e)}
def print_results(results):
"""Pretty-print search results."""
if 'error' in results:
print(f"Error: {results['error']}")
return
print(f"\nSearch term: '{results['search_text']}'")
print(f"Total matches: {results['results_total']}")
print(f"Pages with matches: {results['results_pages']}\n")
if results['results']:
for match in results['results']:
print(f" Page {match['page']}: ...{match['matched_line']}...")
print(f" Matched: '{match['exact_word']}'\n")
else:
print(" No matches found.")
# Example usage
if __name__ == "__main__":
pdf_url = "https://pdfobject.com/pdf/sample.pdf"
# Basic search
print("=== Basic Search ===")
results = search_pdf(pdf_url, "PDF")
print_results(results)
# Regex search: find words starting with capital letter
print("\n=== Regex Search ===")
results = search_pdf(pdf_url, "[A-Z][a-z]+", use_regex=True)
print_results(results)
Run the Script
python pdf_search.py
Output
=== Basic Search ===
Search term: 'PDF'
Total matches: 2
Pages with matches: 1
Page 1: ...Sample PDF...
Matched: 'PDF'
Page 1: ...This is a simple PDF file. Fun fun fun....
Matched: 'PDF'
=== Regex Search ===
Search term: '[A-Z][a-z]+'
Total matches: 36
Pages with matches: 1
Page 1: ...Sample PDF...
Matched: 'Sample'
Page 1: ...This is a simple PDF file. Fun fun fun....
Matched: 'This'
... (more matches)
Step 3: Add a Simple Web Interface (Optional)
Want to let users search PDFs from a web form? Here is a minimal Flask app:
from flask import Flask, request, jsonify, render_template_string
import requests
app = Flask(__name__)
API_TOKEN = "YOUR_API_TOKEN"
TEMPLATE = """
<!DOCTYPE html>
<html>
<head><title>PDF Search</title></head>
<body style="font-family: sans-serif; max-width: 600px; margin: 50px auto;">
<h1>Search Inside PDF</h1>
<form method="POST">
<input name="pdf_url" placeholder="PDF URL" style="width: 100%; padding: 10px; margin: 5px 0;" required>
<input name="search_term" placeholder="Search term" style="width: 100%; padding: 10px; margin: 5px 0;" required>
<button type="submit" style="padding: 10px 20px;">Search</button>
</form>
{% if results %}
<div style="margin-top: 20px; padding: 15px; background: #f5f5f5;">
<strong>Found {{ results.results_total }} matches on {{ results.results_pages }} page(s)</strong>
<ul>
{% for match in results.results %}
<li>Page {{ match.page }}: "{{ match.matched_line }}"</li>
{% endfor %}
</ul>
</div>
{% endif %}
</body>
</html>
"""
@app.route('/', methods=['GET', 'POST'])
def search():
results = None
if request.method == 'POST':
response = requests.post(
'https://apdf.io/api/pdf/content/search',
headers={
'Authorization': f'Bearer {API_TOKEN}',
'Accept': 'application/json',
'Content-Type': 'application/json'
},
json={
'file': request.form['pdf_url'],
'text': request.form['search_term']
}
)
results = response.json()
return render_template_string(TEMPLATE, results=results)
if __name__ == '__main__':
app.run(debug=True)
Run it with python app.py and open http://localhost:5000.
API Response Explained
The Search API returns a JSON object with these fields:
{
"search_text": "invoice",
"results_total": 5,
"results_pages": 2,
"results": [
{
"page": "1",
"matched_line": "Invoice #12345 dated November 2025",
"exact_word": "Invoice"
},
{
"page": "3",
"matched_line": "Please pay this invoice within 30 days",
"exact_word": "invoice"
}
]
}
- search_text: The term you searched for
- results_total: Total number of matches found
- results_pages: Number of pages containing at least one match
- results: Array of match objects with page number, the full line of text, and the exact matched word
Next Steps
- Extract Full Text: Use the Read Content endpoint to extract all text from specific pages for further processing.
- Jump to Specific Pages: Once you know which pages contain matches, use the Extract Pages endpoint to create a new PDF with only the relevant pages.