Search Inside Scanned Contracts and Documents with PHP
Your legal team has a filing cabinet full of scanned contracts from the past decade. A client dispute comes in, and someone needs to find every contract mentioning "indemnification clause" or a specific company name. Manually opening hundreds of PDFs isn't an option.
The challenge with scanned documents is that they're just images. Standard PDF search won't work because there's no actual text layer. You need OCR (Optical Character Recognition) to read the text first.
The aPDF.io OCR Search API handles this in one step: send a scanned PDF and a search term, and get back every match with page numbers and context. No preprocessing, no separate OCR step.
Quick Example
<?php
$apiToken = 'YOUR_API_TOKEN';
$apiUrl = 'https://apdf.io/api/pdf/ocr/search';
$data = [
'file' => 'https://example.com/scanned-contract.pdf',
'text' => 'indemnification'
];
$ch = curl_init($apiUrl);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => [
'Authorization: Bearer ' . $apiToken,
'Content-Type: application/json',
'Accept: application/json'
],
CURLOPT_POSTFIELDS => json_encode($data)
]);
$response = curl_exec($ch);
curl_close($ch);
$result = json_decode($response, true);
echo "Found {$result['results_total']} matches across {$result['results_pages']} pages\n";
foreach ($result['results'] as $match) {
echo "Page {$match['page']}: {$match['matched_line']}\n";
}
Understanding the Response
The API returns structured search results:
{
"search_text": "indemnification",
"results_total": 3,
"results_pages": 2,
"results": [
{
"page": "4",
"matched_line": "Section 8.2: Indemnification. The Contractor shall indemnify and hold harmless...",
"exact_word": "Indemnification"
},
{
"page": "4",
"matched_line": "...subject to the indemnification provisions outlined in Section 8.2.",
"exact_word": "indemnification"
},
{
"page": "12",
"matched_line": "See Indemnification Clause (Exhibit B) for additional terms.",
"exact_word": "Indemnification"
}
]
}
Each result includes the page number and the full line where the match was found, giving you context without needing to open the PDF.
Real-World Scenario: Contract Compliance Checker
You're building a compliance tool for a law firm. Before signing new contracts, they need to verify that certain required clauses are present. Here's a PHP script that checks multiple contracts:
<?php
$apiToken = 'YOUR_API_TOKEN';
$apiUrl = 'https://apdf.io/api/pdf/ocr/search';
// Required clauses that must be present in every contract
$requiredClauses = [
'indemnification',
'limitation of liability',
'confidentiality',
'termination'
];
function searchInPdf($pdfUrl, $searchText, $apiToken, $apiUrl) {
$data = [
'file' => $pdfUrl,
'text' => $searchText
];
$ch = curl_init($apiUrl);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => [
'Authorization: Bearer ' . $apiToken,
'Content-Type: application/json',
'Accept: application/json'
],
CURLOPT_POSTFIELDS => json_encode($data)
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200) {
return ['error' => 'API request failed'];
}
return json_decode($response, true);
}
function checkContractCompliance($contractUrl, $requiredClauses, $apiToken, $apiUrl) {
echo "Checking contract: $contractUrl\n";
echo str_repeat('-', 50) . "\n";
$missing = [];
$found = [];
foreach ($requiredClauses as $clause) {
$result = searchInPdf($contractUrl, $clause, $apiToken, $apiUrl);
if (isset($result['error'])) {
echo "Error searching for '$clause': {$result['error']}\n";
continue;
}
if ($result['results_total'] > 0) {
$found[] = [
'clause' => $clause,
'count' => $result['results_total'],
'pages' => $result['results_pages']
];
echo "✓ Found '$clause' - {$result['results_total']} mentions on {$result['results_pages']} page(s)\n";
} else {
$missing[] = $clause;
echo "✗ Missing '$clause'\n";
}
// Rate limiting
usleep(500000);
}
echo "\n";
if (empty($missing)) {
echo "PASSED: All required clauses found.\n";
return true;
} else {
echo "FAILED: Missing clauses: " . implode(', ', $missing) . "\n";
return false;
}
}
// Check a contract
$contractUrl = 'https://your-storage.com/contracts/vendor-agreement-2024.pdf';
$isCompliant = checkContractCompliance($contractUrl, $requiredClauses, $apiToken, $apiUrl);
Case-Sensitive and Regex Search
For more precise searches, you can enable case sensitivity or use regular expressions:
<?php
// Case-sensitive search (finds "LLC" but not "llc")
$data = [
'file' => $pdfUrl,
'text' => 'LLC',
'case' => 1
];
// Regex search (finds dates like "2024-01-15" or "2023-12-31")
$data = [
'file' => $pdfUrl,
'text' => '\\d{4}-\\d{2}-\\d{2}',
'regex' => 1
];
// Regex for dollar amounts (finds "$1,000.00", "$50,000", etc.)
$data = [
'file' => $pdfUrl,
'text' => '\\$[\\d,]+(\\.\\d{2})?',
'regex' => 1
];
Batch Search Across Multiple Documents
When you need to find a specific term across an entire archive:
<?php
function searchAcrossArchive($documents, $searchTerm, $apiToken, $apiUrl) {
$matches = [];
foreach ($documents as $doc) {
echo "Searching: {$doc['name']}... ";
$result = searchInPdf($doc['url'], $searchTerm, $apiToken, $apiUrl);
if (isset($result['results_total']) && $result['results_total'] > 0) {
echo "Found {$result['results_total']} matches\n";
$matches[] = [
'document' => $doc['name'],
'url' => $doc['url'],
'total_matches' => $result['results_total'],
'pages' => $result['results_pages'],
'details' => $result['results']
];
} else {
echo "No matches\n";
}
usleep(500000); // Rate limiting
}
return $matches;
}
// Archive of scanned contracts
$archive = [
['name' => 'Contract-2020-001', 'url' => 'https://storage.example.com/2020-001.pdf'],
['name' => 'Contract-2020-002', 'url' => 'https://storage.example.com/2020-002.pdf'],
['name' => 'Contract-2021-001', 'url' => 'https://storage.example.com/2021-001.pdf'],
];
// Find all contracts mentioning a specific company
$matches = searchAcrossArchive($archive, 'Acme Corporation', $apiToken, $apiUrl);
echo "\n=== Summary ===\n";
echo "Found '" . 'Acme Corporation' . "' in " . count($matches) . " document(s)\n";
foreach ($matches as $match) {
echo "- {$match['document']}: {$match['total_matches']} matches on {$match['pages']} page(s)\n";
}
Next Steps
Now that you can search scanned documents, consider these related features:
- Extract full text: Use the OCR Read endpoint to extract all text from a scanned document for indexing or analysis.
- Make PDFs searchable: Use the OCR Convert endpoint to add a text layer to scanned PDFs, making them searchable in any PDF reader.