Convert Scanned PDFs to Searchable Documents with Python
Scanned PDFs are everywhere: old contracts, signed forms, receipts, or documents from legacy systems. The problem? They're just images trapped inside a PDF wrapper. You can't search them, copy text, or feed them into your data pipelines.
Running OCR (Optical Character Recognition) locally is painful. You need to install multiple tools, deal with language packs, and write custom image preprocessing code. It works, but it's slow and unreliable.
A much easier approach: send the scanned PDF to an API and get back a searchable PDF with an invisible text layer. The original look is preserved, but now you can select, copy, and search the text.
Quick Example
import time
import requests
API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/ocr/convert"
STATUS_URL = "https://apdf.io/api/job/status/check"
# Helper: poll until the async job finishes, then return its result.
def wait_for_job(job_id, max_attempts=1200):
for _ in range(max_attempts):
check = requests.post(
STATUS_URL,
headers={"Authorization": f"Bearer {API_TOKEN}"},
data={"id": job_id},
)
check.raise_for_status()
body = check.json()
if body["status"] == "successful":
return body["result"]
if body["status"] == "failed":
raise RuntimeError(body.get("error") or "Job failed")
time.sleep(2)
raise TimeoutError("Job did not finish in time")
response = requests.post(
API_URL,
headers={
"Authorization": f"Bearer {API_TOKEN}",
"Accept": "application/json",
"Content-Type": "application/json"
},
json={
"file": "https://example.com/scanned-document.pdf"
}
)
job_id = response.json()["job_id"]
result = wait_for_job(job_id)
print(f"Searchable PDF: {result['file']}")
That's it. The API returns a URL to the new PDF with embedded text.
Real-World Scenario: Digitizing a Paper Archive
Imagine you're building a document management system for a law firm. They have thousands of scanned case files from the 2000s. Lawyers need to search for specific terms like client names or case numbers.
Here's a Python script that processes a batch of scanned PDFs and converts them to searchable documents:
import requests
import time
API_TOKEN = "YOUR_API_TOKEN"
API_URL = "https://apdf.io/api/pdf/ocr/convert"
STATUS_URL = "https://apdf.io/api/job/status/check"
# List of scanned PDFs to process
scanned_files = [
"https://your-storage.com/case-2001-smith.pdf",
"https://your-storage.com/case-2002-jones.pdf",
"https://your-storage.com/case-2003-wilson.pdf"
]
# Helper: poll until the async job finishes, then return its result.
def wait_for_job(job_id, max_attempts=1200):
for _ in range(max_attempts):
check = requests.post(
STATUS_URL,
headers={"Authorization": f"Bearer {API_TOKEN}"},
data={"id": job_id},
)
check.raise_for_status()
body = check.json()
if body["status"] == "successful":
return body["result"]
if body["status"] == "failed":
raise RuntimeError(body.get("error") or "Job failed")
time.sleep(2)
raise TimeoutError("Job did not finish in time")
def convert_to_searchable(file_url):
"""Convert a scanned PDF to searchable PDF using OCR"""
response = requests.post(
API_URL,
headers={
"Authorization": f"Bearer {API_TOKEN}",
"Accept": "application/json",
"Content-Type": "application/json"
},
json={"file": file_url}
)
if response.status_code != 200:
print(f"Error processing {file_url}: {response.text}")
return None
try:
return wait_for_job(response.json()["job_id"])
except RuntimeError as e:
print(f"Job failed for {file_url}: {e}")
return None
# Process all files
for file_url in scanned_files:
print(f"Processing: {file_url}")
result = convert_to_searchable(file_url)
if result:
print(f" -> Searchable PDF: {result['file']}")
print(f" -> Pages: {result['pages']}, Size: {result['size']} bytes")
print("\\nDone! All PDFs are now searchable.")
What Happens Behind the Scenes
When you call the OCR convert endpoint:
- The API downloads your scanned PDF
- Each page is analyzed using OCR to extract text
- An invisible text layer is added on top of the original image
- You get back a new PDF that looks identical but is fully searchable
The original layout, fonts, and images are preserved. The only difference is that you can now select text with your mouse and use Ctrl+F to search.
Skip Polling with Webhooks
The OCR endpoint always runs in the background. The examples above poll until the job is done.
If you'd rather not poll — for example, in a serverless function or a long-running batch —
pass a webhook_url in the original request and aPDF.io will POST
the same result payload to that URL once the job completes:
requests.post(
API_URL,
headers={
"Authorization": f"Bearer {API_TOKEN}",
"Accept": "application/json",
"Content-Type": "application/json"
},
json={
"file": "https://example.com/large-scanned-document.pdf",
"webhook_url": "https://your-app.com/webhooks/ocr-complete"
}
)
Next Steps
Now that your scanned PDFs are searchable, you can:
- Search for text: Use the Search endpoint to find specific terms across your documents.
- Extract text for AI: Use the Content Read endpoint to extract the OCR'd text for RAG pipelines or LLM processing.