Building a "Search Inside PDF" Feature for Your Web App

Ever uploaded a 200-page PDF and needed to find where it mentions "Total" or "Error"? Scrolling through pages manually is painful. Building full-text search from scratch means parsing PDFs, indexing content, and handling edge cases.

There is a much simpler approach: use an API that does the heavy lifting. In this tutorial, we will build a Python function that searches inside any PDF and returns the exact page numbers and context lines where your search term appears.

Quick Start: Search a PDF in 10 Lines

Here is the simplest possible example using the free aPDF.io API:
import requests

response = requests.post(
    'https://apdf.io/api/pdf/content/search',
    headers={
        'Authorization': 'Bearer YOUR_API_TOKEN',
        'Accept': 'application/json',
        'Content-Type': 'application/json'
    },
    json={
        'file': 'https://example.com/document.pdf',
        'text': 'invoice'
    }
)

print(response.json())
That is it. The API returns every occurrence of "invoice" with page numbers and surrounding context.

Real-World Use Case: Support Ticket Search

Imagine you are building a support dashboard. Customers upload PDF documents (contracts, invoices, error logs), and your team needs to quickly find specific information without opening each file.

Let us build a reusable function that:

  1. Takes a PDF URL and search term
  2. Calls the aPDF.io Search API
  3. Returns a clean summary of matches

Step 1: Get Your API Token

  1. Go to aPDF.io and sign up (free).
  2. Copy your API Token from the dashboard.

Install the requests library if you have not already:

pip install requests

Step 2: The Search Function

Create a file named pdf_search.py:
import requests

API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/content/search"

def search_pdf(pdf_url, search_term, case_sensitive=False, use_regex=False):
    """
    Search for a term inside a PDF and return matches with context.

    Args:
        pdf_url: URL of the PDF file to search
        search_term: Text to search for
        case_sensitive: Set True for case-sensitive matching
        use_regex: Set True to interpret search_term as a regex pattern

    Returns:
        dict with search results or error message
    """
    payload = {
        'file': pdf_url,
        'text': search_term
    }

    if case_sensitive:
        payload['case'] = '1'
    if use_regex:
        payload['regex'] = '1'

    try:
        response = requests.post(
            API_URL,
            headers={
                'Authorization': f'Bearer {API_TOKEN}',
                'Accept': 'application/json',
                'Content-Type': 'application/json'
            },
            json=payload
        )

        if response.status_code == 200:
            return response.json()
        else:
            return {'error': f'API returned {response.status_code}: {response.text}'}

    except Exception as e:
        return {'error': str(e)}


def print_results(results):
    """Pretty-print search results."""
    if 'error' in results:
        print(f"Error: {results['error']}")
        return

    print(f"\nSearch term: '{results['search_text']}'")
    print(f"Total matches: {results['results_total']}")
    print(f"Pages with matches: {results['results_pages']}\n")

    if results['results']:
        for match in results['results']:
            print(f"  Page {match['page']}: ...{match['matched_line']}...")
            print(f"    Matched: '{match['exact_word']}'\n")
    else:
        print("  No matches found.")


# Example usage
if __name__ == "__main__":
    pdf_url = "https://pdfobject.com/pdf/sample.pdf"

    # Basic search
    print("=== Basic Search ===")
    results = search_pdf(pdf_url, "PDF")
    print_results(results)

    # Regex search: find words starting with capital letter
    print("\n=== Regex Search ===")
    results = search_pdf(pdf_url, "[A-Z][a-z]+", use_regex=True)
    print_results(results)

Run the Script

python pdf_search.py

Output

=== Basic Search ===

Search term: 'PDF'
Total matches: 2
Pages with matches: 1

  Page 1: ...Sample PDF...
    Matched: 'PDF'

  Page 1: ...This is a simple PDF file. Fun fun fun....
    Matched: 'PDF'

=== Regex Search ===

Search term: '[A-Z][a-z]+'
Total matches: 36
Pages with matches: 1

  Page 1: ...Sample PDF...
    Matched: 'Sample'

  Page 1: ...This is a simple PDF file. Fun fun fun....
    Matched: 'This'

  ... (more matches)

Step 3: Add a Simple Web Interface (Optional)

Want to let users search PDFs from a web form? Here is a minimal Flask app:

from flask import Flask, request, jsonify, render_template_string
import requests

app = Flask(__name__)
API_TOKEN = "YOUR_API_TOKEN"

TEMPLATE = """
<!DOCTYPE html>
<html>
<head><title>PDF Search</title></head>
<body style="font-family: sans-serif; max-width: 600px; margin: 50px auto;">
    <h1>Search Inside PDF</h1>
    <form method="POST">
        <input name="pdf_url" placeholder="PDF URL" style="width: 100%; padding: 10px; margin: 5px 0;" required>
        <input name="search_term" placeholder="Search term" style="width: 100%; padding: 10px; margin: 5px 0;" required>
        <button type="submit" style="padding: 10px 20px;">Search</button>
    </form>
    {% if results %}
    <div style="margin-top: 20px; padding: 15px; background: #f5f5f5;">
        <strong>Found {{ results.results_total }} matches on {{ results.results_pages }} page(s)</strong>
        <ul>
        {% for match in results.results %}
            <li>Page {{ match.page }}: "{{ match.matched_line }}"</li>
        {% endfor %}
        </ul>
    </div>
    {% endif %}
</body>
</html>
"""

@app.route('/', methods=['GET', 'POST'])
def search():
    results = None
    if request.method == 'POST':
        response = requests.post(
            'https://apdf.io/api/pdf/content/search',
            headers={
                'Authorization': f'Bearer {API_TOKEN}',
                'Accept': 'application/json',
                'Content-Type': 'application/json'
            },
            json={
                'file': request.form['pdf_url'],
                'text': request.form['search_term']
            }
        )
        results = response.json()
    return render_template_string(TEMPLATE, results=results)

if __name__ == '__main__':
    app.run(debug=True)

Run it with python app.py and open http://localhost:5000.

API Response Explained

The Search API returns a JSON object with these fields:

{
    "search_text": "invoice",
    "results_total": 5,
    "results_pages": 2,
    "results": [
        {
            "page": "1",
            "matched_line": "Invoice #12345 dated November 2025",
            "exact_word": "Invoice"
        },
        {
            "page": "3",
            "matched_line": "Please pay this invoice within 30 days",
            "exact_word": "invoice"
        }
    ]
}
  • search_text: The term you searched for
  • results_total: Total number of matches found
  • results_pages: Number of pages containing at least one match
  • results: Array of match objects with page number, the full line of text, and the exact matched word

Next Steps

Now that you can search inside PDFs, here are some useful follow-up tasks:
  • Extract Full Text: Use the Read Content endpoint to extract all text from specific pages for further processing.
  • Jump to Specific Pages: Once you know which pages contain matches, use the Extract Pages endpoint to create a new PDF with only the relevant pages.
Ready to build?
Get Started for Free