Convert Scanned PDFs to Searchable Documents with Python

Scanned PDFs are everywhere: old contracts, signed forms, receipts, or documents from legacy systems. The problem? They're just images trapped inside a PDF wrapper. You can't search them, copy text, or feed them into your data pipelines.

Running OCR (Optical Character Recognition) locally is painful. You need to install multiple tools, deal with language packs, and write custom image preprocessing code. It works, but it's slow and unreliable.

A much easier approach: send the scanned PDF to an API and get back a searchable PDF with an invisible text layer. The original look is preserved, but now you can select, copy, and search the text.

Quick Example

Here's how simple it is with Python and the aPDF.io OCR API:
import time
import requests

API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/ocr/convert"
STATUS_URL = "https://apdf.io/api/job/status/check"

# Helper: poll until the async job finishes, then return its result.
def wait_for_job(job_id, max_attempts=1200):
    for _ in range(max_attempts):
        check = requests.post(
            STATUS_URL,
            headers={"Authorization": f"Bearer {API_TOKEN}"},
            data={"id": job_id},
        )
        check.raise_for_status()
        body = check.json()
        if body["status"] == "successful":
            return body["result"]
        if body["status"] == "failed":
            raise RuntimeError(body.get("error") or "Job failed")
        time.sleep(2)
    raise TimeoutError("Job did not finish in time")

response = requests.post(
    API_URL,
    headers={
        "Authorization": f"Bearer {API_TOKEN}",
        "Accept": "application/json",
        "Content-Type": "application/json"
    },
    json={
        "file": "https://example.com/scanned-document.pdf"
    }
)

job_id = response.json()["job_id"]
result = wait_for_job(job_id)
print(f"Searchable PDF: {result['file']}")

That's it. The API returns a URL to the new PDF with embedded text.

Real-World Scenario: Digitizing a Paper Archive

Imagine you're building a document management system for a law firm. They have thousands of scanned case files from the 2000s. Lawyers need to search for specific terms like client names or case numbers.

Here's a Python script that processes a batch of scanned PDFs and converts them to searchable documents:

import requests
import time

API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/ocr/convert"
STATUS_URL = "https://apdf.io/api/job/status/check"

# List of scanned PDFs to process
scanned_files = [
    "https://your-storage.com/case-2001-smith.pdf",
    "https://your-storage.com/case-2002-jones.pdf",
    "https://your-storage.com/case-2003-wilson.pdf"
]

# Helper: poll until the async job finishes, then return its result.
def wait_for_job(job_id, max_attempts=1200):
    for _ in range(max_attempts):
        check = requests.post(
            STATUS_URL,
            headers={"Authorization": f"Bearer {API_TOKEN}"},
            data={"id": job_id},
        )
        check.raise_for_status()
        body = check.json()
        if body["status"] == "successful":
            return body["result"]
        if body["status"] == "failed":
            raise RuntimeError(body.get("error") or "Job failed")
        time.sleep(2)
    raise TimeoutError("Job did not finish in time")

def convert_to_searchable(file_url):
    """Convert a scanned PDF to searchable PDF using OCR"""
    response = requests.post(
        API_URL,
        headers={
            "Authorization": f"Bearer {API_TOKEN}",
            "Accept": "application/json",
            "Content-Type": "application/json"
        },
        json={"file": file_url}
    )

    if response.status_code != 200:
        print(f"Error processing {file_url}: {response.text}")
        return None

    try:
        return wait_for_job(response.json()["job_id"])
    except RuntimeError as e:
        print(f"Job failed for {file_url}: {e}")
        return None

# Process all files
for file_url in scanned_files:
    print(f"Processing: {file_url}")
    result = convert_to_searchable(file_url)

    if result:
        print(f"  -> Searchable PDF: {result['file']}")
        print(f"  -> Pages: {result['pages']}, Size: {result['size']} bytes")

print("\\nDone! All PDFs are now searchable.")

What Happens Behind the Scenes

When you call the OCR convert endpoint:

  1. The API downloads your scanned PDF
  2. Each page is analyzed using OCR to extract text
  3. An invisible text layer is added on top of the original image
  4. You get back a new PDF that looks identical but is fully searchable

The original layout, fonts, and images are preserved. The only difference is that you can now select text with your mouse and use Ctrl+F to search.

Skip Polling with Webhooks

The OCR endpoint always runs in the background. The examples above poll until the job is done. If you'd rather not poll — for example, in a serverless function or a long-running batch — pass a webhook_url in the original request and aPDF.io will POST the same result payload to that URL once the job completes:

requests.post(
    API_URL,
    headers={
        "Authorization": f"Bearer {API_TOKEN}",
        "Accept": "application/json",
        "Content-Type": "application/json"
    },
    json={
        "file": "https://example.com/large-scanned-document.pdf",
        "webhook_url": "https://your-app.com/webhooks/ocr-complete"
    }
)

Next Steps

Now that your scanned PDFs are searchable, you can:

  • Search for text: Use the Search endpoint to find specific terms across your documents.
  • Extract text for AI: Use the Content Read endpoint to extract the OCR'd text for RAG pipelines or LLM processing.
Ready to build?
Get Started for Free