Search Inside Scanned Contracts and Documents with PHP

Your legal team has a filing cabinet full of scanned contracts from the past decade. A client dispute comes in, and someone needs to find every contract mentioning "indemnification clause" or a specific company name. Manually opening hundreds of PDFs isn't an option.

The challenge with scanned documents is that they're just images. Standard PDF search won't work because there's no actual text layer. You need OCR (Optical Character Recognition) to read the text first.

The aPDF.io OCR Search API handles this in one step: send a scanned PDF and a search term, and get back every match with page numbers and context. No preprocessing, no separate OCR step.

Quick Example

Here's how to search a scanned PDF with PHP:

<?php
$apiToken = 'YOUR_API_TOKEN';
$apiUrl = 'https://apdf.io/api/pdf/ocr/search';
$statusUrl = 'https://apdf.io/api/job/status/check';

// Helper: poll until the async job finishes, then return its result.
function wait_for_job(string $jobId, string $apiToken, string $statusUrl) {
    for ($i = 0; $i < 1200; $i++) {
        $ch = curl_init($statusUrl);
        curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_POST => true,
            CURLOPT_HTTPHEADER => ['Authorization: Bearer ' . $apiToken],
            CURLOPT_POSTFIELDS => http_build_query(['id' => $jobId]),
        ]);
        $body = json_decode(curl_exec($ch), true);
        curl_close($ch);

        if ($body['status'] === 'successful') return $body['result'];
        if ($body['status'] === 'failed') throw new RuntimeException($body['error'] ?? 'Job failed');
        sleep(2);
    }
    throw new RuntimeException('Job did not finish in time');
}

$data = [
    'file' => 'https://example.com/scanned-contract.pdf',
    'text' => 'indemnification'
];

$ch = curl_init($apiUrl);
curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => [
        'Authorization: Bearer ' . $apiToken,
        'Content-Type: application/json',
        'Accept: application/json'
    ],
    CURLOPT_POSTFIELDS => json_encode($data)
]);

$jobId = json_decode(curl_exec($ch), true)['job_id'];
curl_close($ch);

$result = wait_for_job($jobId, $apiToken, $statusUrl);

echo "Found {$result['results_total']} matches across {$result['results_pages']} pages\n";

foreach ($result['results'] as $match) {
    echo "Page {$match['page']}: {$match['matched_line']}\n";
}

Understanding the Response

The API returns structured search results:

{
  "search_text": "indemnification",
  "results_total": 3,
  "results_pages": 2,
  "results": [
    {
      "page": "4",
      "matched_line": "Section 8.2: Indemnification. The Contractor shall indemnify and hold harmless...",
      "exact_word": "Indemnification"
    },
    {
      "page": "4",
      "matched_line": "...subject to the indemnification provisions outlined in Section 8.2.",
      "exact_word": "indemnification"
    },
    {
      "page": "12",
      "matched_line": "See Indemnification Clause (Exhibit B) for additional terms.",
      "exact_word": "Indemnification"
    }
  ]
}

Each result includes the page number and the full line where the match was found, giving you context without needing to open the PDF.

Real-World Scenario: Contract Compliance Checker

You're building a compliance tool for a law firm. Before signing new contracts, they need to verify that certain required clauses are present. Here's a PHP script that checks multiple contracts:

<?php
$apiToken = 'YOUR_API_TOKEN';
$apiUrl = 'https://apdf.io/api/pdf/ocr/search';
$statusUrl = 'https://apdf.io/api/job/status/check';

// Required clauses that must be present in every contract
$requiredClauses = [
    'indemnification',
    'limitation of liability',
    'confidentiality',
    'termination'
];

// Helper: poll until the async job finishes, then return its result.
function wait_for_job(string $jobId, string $apiToken, string $statusUrl) {
    for ($i = 0; $i < 1200; $i++) {
        $ch = curl_init($statusUrl);
        curl_setopt_array($ch, [
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_POST => true,
            CURLOPT_HTTPHEADER => ['Authorization: Bearer ' . $apiToken],
            CURLOPT_POSTFIELDS => http_build_query(['id' => $jobId]),
        ]);
        $body = json_decode(curl_exec($ch), true);
        curl_close($ch);

        if ($body['status'] === 'successful') return $body['result'];
        if ($body['status'] === 'failed') throw new RuntimeException($body['error'] ?? 'Job failed');
        sleep(2);
    }
    throw new RuntimeException('Job did not finish in time');
}

function searchInPdf($pdfUrl, $searchText, $apiToken, $apiUrl, $statusUrl) {
    $data = [
        'file' => $pdfUrl,
        'text' => $searchText
    ];

    $ch = curl_init($apiUrl);
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => true,
        CURLOPT_HTTPHEADER => [
            'Authorization: Bearer ' . $apiToken,
            'Content-Type: application/json',
            'Accept: application/json'
        ],
        CURLOPT_POSTFIELDS => json_encode($data)
    ]);

    $response = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);

    if ($httpCode !== 200) {
        return ['error' => 'API request failed'];
    }

    $jobId = json_decode($response, true)['job_id'];
    try {
        return wait_for_job($jobId, $apiToken, $statusUrl);
    } catch (RuntimeException $e) {
        return ['error' => $e->getMessage()];
    }
}

function checkContractCompliance($contractUrl, $requiredClauses, $apiToken, $apiUrl, $statusUrl) {
    echo "Checking contract: $contractUrl\n";
    echo str_repeat('-', 50) . "\n";

    $missing = [];
    $found = [];

    foreach ($requiredClauses as $clause) {
        $result = searchInPdf($contractUrl, $clause, $apiToken, $apiUrl, $statusUrl);

        if (isset($result['error'])) {
            echo "Error searching for '$clause': {$result['error']}\n";
            continue;
        }

        if ($result['results_total'] > 0) {
            $found[] = [
                'clause' => $clause,
                'count' => $result['results_total'],
                'pages' => $result['results_pages']
            ];
            echo "✓ Found '$clause' - {$result['results_total']} mentions on {$result['results_pages']} page(s)\n";
        } else {
            $missing[] = $clause;
            echo "✗ Missing '$clause'\n";
        }

        // Rate limiting
        usleep(500000);
    }

    echo "\n";

    if (empty($missing)) {
        echo "PASSED: All required clauses found.\n";
        return true;
    } else {
        echo "FAILED: Missing clauses: " . implode(', ', $missing) . "\n";
        return false;
    }
}

// Check a contract
$contractUrl = 'https://your-storage.com/contracts/vendor-agreement-2024.pdf';
$isCompliant = checkContractCompliance($contractUrl, $requiredClauses, $apiToken, $apiUrl, $statusUrl);

Case-Sensitive and Regex Search

For more precise searches, you can enable case sensitivity or use regular expressions:

<?php
// Case-sensitive search (finds "LLC" but not "llc")
$data = [
    'file' => $pdfUrl,
    'text' => 'LLC',
    'case' => 1
];

// Regex search (finds dates like "2024-01-15" or "2023-12-31")
$data = [
    'file' => $pdfUrl,
    'text' => '\\d{4}-\\d{2}-\\d{2}',
    'regex' => 1
];

// Regex for dollar amounts (finds "$1,000.00", "$50,000", etc.)
$data = [
    'file' => $pdfUrl,
    'text' => '\\$[\\d,]+(\\.\\d{2})?',
    'regex' => 1
];

Batch Search Across Multiple Documents

When you need to find a specific term across an entire archive:

<?php
function searchAcrossArchive($documents, $searchTerm, $apiToken, $apiUrl, $statusUrl) {
    $matches = [];

    foreach ($documents as $doc) {
        echo "Searching: {$doc['name']}... ";

        $result = searchInPdf($doc['url'], $searchTerm, $apiToken, $apiUrl, $statusUrl);

        if (isset($result['results_total']) && $result['results_total'] > 0) {
            echo "Found {$result['results_total']} matches\n";
            $matches[] = [
                'document' => $doc['name'],
                'url' => $doc['url'],
                'total_matches' => $result['results_total'],
                'pages' => $result['results_pages'],
                'details' => $result['results']
            ];
        } else {
            echo "No matches\n";
        }

        usleep(500000); // Rate limiting
    }

    return $matches;
}

// Archive of scanned contracts
$archive = [
    ['name' => 'Contract-2020-001', 'url' => 'https://storage.example.com/2020-001.pdf'],
    ['name' => 'Contract-2020-002', 'url' => 'https://storage.example.com/2020-002.pdf'],
    ['name' => 'Contract-2021-001', 'url' => 'https://storage.example.com/2021-001.pdf'],
];

// Find all contracts mentioning a specific company
$matches = searchAcrossArchive($archive, 'Acme Corporation', $apiToken, $apiUrl, $statusUrl);

echo "\n=== Summary ===\n";
echo "Found '" . 'Acme Corporation' . "' in " . count($matches) . " document(s)\n";

foreach ($matches as $match) {
    echo "- {$match['document']}: {$match['total_matches']} matches on {$match['pages']} page(s)\n";
}

Next Steps

Now that you can search scanned documents, consider these related features:

Extract full text: Use the OCR Read endpoint to extract all text from a scanned document for indexing or analysis.
Make PDFs searchable: Use the OCR Convert endpoint to add a text layer to scanned PDFs, making them searchable in any PDF reader.

Ready to build?

Get Started for Free