Convert Scanned PDFs to Searchable Documents with Python

Scanned PDFs are everywhere: old contracts, signed forms, receipts, or documents from legacy systems. The problem? They're just images trapped inside a PDF wrapper. You can't search them, copy text, or feed them into your data pipelines.

Running OCR (Optical Character Recognition) locally is painful. You need to install multiple tools, deal with language packs, and write custom image preprocessing code. It works, but it's slow and unreliable.

A much easier approach: send the scanned PDF to an API and get back a searchable PDF with an invisible text layer. The original look is preserved, but now you can select, copy, and search the text.

Quick Example

Here's how simple it is with Python and the aPDF.io OCR API:
import requests

API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/ocr/convert"

response = requests.post(
    API_URL,
    headers={
        "Authorization": f"Bearer {API_TOKEN}",
        "Accept": "application/json",
        "Content-Type": "application/json"
    },
    json={
        "file": "https://example.com/scanned-document.pdf"
    }
)

result = response.json()
print(f"Searchable PDF: {result['file']}")

That's it. The API returns a URL to the new PDF with embedded text.

Real-World Scenario: Digitizing a Paper Archive

Imagine you're building a document management system for a law firm. They have thousands of scanned case files from the 2000s. Lawyers need to search for specific terms like client names or case numbers.

Here's a Python script that processes a batch of scanned PDFs and converts them to searchable documents:

import requests
import time

API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/ocr/convert"

# List of scanned PDFs to process
scanned_files = [
    "https://your-storage.com/case-2001-smith.pdf",
    "https://your-storage.com/case-2002-jones.pdf",
    "https://your-storage.com/case-2003-wilson.pdf"
]

def convert_to_searchable(file_url):
    """Convert a scanned PDF to searchable PDF using OCR"""
    response = requests.post(
        API_URL,
        headers={
            "Authorization": f"Bearer {API_TOKEN}",
            "Accept": "application/json",
            "Content-Type": "application/json"
        },
        json={"file": file_url}
    )

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error processing {file_url}: {response.text}")
        return None

# Process all files
for file_url in scanned_files:
    print(f"Processing: {file_url}")
    result = convert_to_searchable(file_url)

    if result:
        print(f"  -> Searchable PDF: {result['file']}")
        print(f"  -> Pages: {result['pages']}, Size: {result['size']} bytes")

    # Small delay to avoid rate limiting
    time.sleep(1)

print("\\nDone! All PDFs are now searchable.")

What Happens Behind the Scenes

When you call the OCR convert endpoint:

  1. The API downloads your scanned PDF
  2. Each page is analyzed using OCR to extract text
  3. An invisible text layer is added on top of the original image
  4. You get back a new PDF that looks identical but is fully searchable

The original layout, fonts, and images are preserved. The only difference is that you can now select text with your mouse and use Ctrl+F to search.

Handling Large Documents with Async Processing

For large scanned documents (100+ pages), use the async parameter to avoid timeout issues:

import requests
import time

API_TOKEN = "YOUR_API_TOKEN"
OCR_URL = "https://apdf.io/api/pdf/ocr/convert"
STATUS_URL = "https://apdf.io/api/job/status/check"

# Start async OCR job
response = requests.post(
    OCR_URL,
    headers={
        "Authorization": f"Bearer {API_TOKEN}",
        "Accept": "application/json",
        "Content-Type": "application/json"
    },
    json={
        "file": "https://example.com/large-scanned-document.pdf",
        "async": 1
    }
)

job_id = response.json()["job_id"]
print(f"Job started: {job_id}")

# Poll for completion
while True:
    status_response = requests.post(
        STATUS_URL,
        headers={
            "Authorization": f"Bearer {API_TOKEN}",
            "Accept": "application/json",
            "Content-Type": "application/json"
        },
        json={"job_id": job_id}
    )

    status = status_response.json()

    if status.get("status") == "completed":
        print(f"Done! Searchable PDF: {status['result']['file']}")
        break
    elif status.get("status") == "failed":
        print(f"Job failed: {status.get('error')}")
        break
    else:
        print("Still processing...")
        time.sleep(5)

Next Steps

Now that your scanned PDFs are searchable, you can:

  • Search for text: Use the Search endpoint to find specific terms across your documents.
  • Extract text for AI: Use the Content Read endpoint to extract the OCR'd text for RAG pipelines or LLM processing.
Ready to build?
Get Started for Free