Extract Invoice Pages from Bank Statements with PHP
Banks and vendors often send combined PDF statements containing multiple documents. Your monthly bank statement might include transaction summaries, attached invoices, and regulatory disclosures—all in one 20-page PDF.
For bookkeeping, you need to extract just the invoice pages (say, pages 5-8) and save them separately. Opening the PDF in Acrobat, selecting pages, and exporting is tedious when you have dozens of statements.
The aPDF.io Split API automates this. Specify which pages you want, and get separate PDF files for each section. No manual work, no desktop software required.
Quick Example
<?php
$apiToken = 'YOUR_API_TOKEN';
$apiUrl = 'https://apdf.io/api/pdf/file/split';
$data = [
'file' => 'https://example.com/bank-statement-january.pdf',
'pages' => '5-8' // Extract only pages 5 through 8
];
$ch = curl_init($apiUrl);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => [
'Authorization: Bearer ' . $apiToken,
'Content-Type: application/json',
'Accept: application/json'
],
CURLOPT_POSTFIELDS => json_encode($data)
]);
$response = curl_exec($ch);
curl_close($ch);
$result = json_decode($response, true);
echo "Extracted PDF: {$result[0]['file']}\n";
echo "Pages: {$result[0]['pages']}\n";
Understanding the Response
The API returns an array of split files, one for each page range you specified:
[
{
"file": "https://apdf-files.s3.eu-central-1.amazonaws.com/191674e262f952ca-5-8.pdf",
"expiration": "2024-12-02T22:27:11.610806Z",
"pages": 4,
"size": 245102
}
]
Note: The file URL is valid for 1 hour. Download and store it in your system promptly.
Real-World Scenario: Automated Invoice Extraction
You're building an expense management system. Users upload bank statements, and the system automatically extracts the invoice sections based on a predefined structure.
<?php
$apiToken = 'YOUR_API_TOKEN';
$apiUrl = 'https://apdf.io/api/pdf/file/split';
/**
* Split a PDF into multiple sections
*/
function splitPdf($pdfUrl, $pageRanges, $apiToken, $apiUrl) {
$data = [
'file' => $pdfUrl,
'pages' => $pageRanges
];
$ch = curl_init($apiUrl);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => [
'Authorization: Bearer ' . $apiToken,
'Content-Type: application/json',
'Accept: application/json'
],
CURLOPT_POSTFIELDS => json_encode($data)
]);
$response = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpCode !== 200) {
throw new Exception("API error: $response");
}
return json_decode($response, true);
}
/**
* Extract specific sections from a bank statement
*/
function extractBankStatementSections($statementUrl) {
global $apiToken, $apiUrl;
// Define page ranges for different sections
// Format: "range1,range2,range3" creates separate PDFs
$pageRanges = implode(',', [
'1-2', // Summary pages
'3-4', // Transaction details
'5-8', // Attached invoices
'9-z' // Disclosures and legal
]);
$results = splitPdf($statementUrl, $pageRanges, $apiToken, $apiUrl);
$sections = [
'summary' => $results[0] ?? null,
'transactions' => $results[1] ?? null,
'invoices' => $results[2] ?? null,
'legal' => $results[3] ?? null,
];
return $sections;
}
// Process a bank statement
$statementUrl = 'https://your-storage.com/statements/january-2024.pdf';
$sections = extractBankStatementSections($statementUrl);
echo "=== Extracted Sections ===\n";
foreach ($sections as $name => $section) {
if ($section) {
echo "$name: {$section['pages']} pages - {$section['file']}\n";
}
}
// Now you can:
// - Archive the invoice section to your accounting system
// - Send transactions to your bookkeeping software
// - Discard the legal disclosures
Page Selection Syntax
The Split API supports powerful page selection patterns:
<?php
// Single page
$data = ['file' => $url, 'pages' => '5'];
// Result: 1 PDF with just page 5
// Page range
$data = ['file' => $url, 'pages' => '5-8'];
// Result: 1 PDF with pages 5, 6, 7, 8
// Multiple specific pages (creates separate PDFs)
$data = ['file' => $url, 'pages' => '1,5,10'];
// Result: 3 PDFs (page 1, page 5, page 10)
// Multiple ranges (creates separate PDFs)
$data = ['file' => $url, 'pages' => '1-3,5-8,10-z'];
// Result: 3 PDFs (pages 1-3, pages 5-8, pages 10 to end)
// From page X to end
$data = ['file' => $url, 'pages' => '5-z'];
// Result: 1 PDF from page 5 to the last page
// Last N pages (using reverse indexing)
$data = ['file' => $url, 'pages' => 'r3-r1'];
// Result: 1 PDF with the last 3 pages
// Split every N pages
$data = ['file' => $url, 'pages' => 'n3'];
// Result: Multiple PDFs, each with 3 pages
// 12-page doc -> 4 PDFs of 3 pages each
Batch Processing Multiple Statements
Process multiple bank statements and extract invoices from each:
<?php
/**
* Process multiple statements and extract invoice sections
*/
function batchExtractInvoices($statements, $invoicePages = '5-8') {
global $apiToken, $apiUrl;
$results = [];
foreach ($statements as $statement) {
echo "Processing: {$statement['name']}... ";
try {
$split = splitPdf(
$statement['url'],
$invoicePages,
$apiToken,
$apiUrl
);
$results[] = [
'statement' => $statement['name'],
'invoice_url' => $split[0]['file'],
'pages' => $split[0]['pages'],
'size' => $split[0]['size']
];
echo "OK ({$split[0]['pages']} pages)\n";
} catch (Exception $e) {
echo "FAILED: {$e->getMessage()}\n";
$results[] = [
'statement' => $statement['name'],
'error' => $e->getMessage()
];
}
// Rate limiting
usleep(500000);
}
return $results;
}
// Monthly statements to process
$statements = [
['name' => 'January 2024', 'url' => 'https://storage.example.com/jan-2024.pdf'],
['name' => 'February 2024', 'url' => 'https://storage.example.com/feb-2024.pdf'],
['name' => 'March 2024', 'url' => 'https://storage.example.com/mar-2024.pdf'],
];
$invoices = batchExtractInvoices($statements);
echo "\n=== Extracted Invoices ===\n";
foreach ($invoices as $inv) {
if (isset($inv['invoice_url'])) {
echo "{$inv['statement']}: {$inv['invoice_url']}\n";
} else {
echo "{$inv['statement']}: Error - {$inv['error']}\n";
}
}
Splitting by Document Type
When statements have a consistent structure, you can create a configuration-driven splitter:
<?php
// Configuration for different statement types
$statementFormats = [
'chase_business' => [
'summary' => '1-2',
'transactions' => '3-5',
'invoices' => '6-10',
'disclosures' => '11-z'
],
'amex_corporate' => [
'summary' => '1',
'transactions' => '2-4',
'rewards' => '5',
'invoices' => '6-z'
],
'vendor_combined' => [
'cover' => '1',
'invoices' => '2-z'
]
];
function splitByFormat($pdfUrl, $formatName) {
global $statementFormats, $apiToken, $apiUrl;
if (!isset($statementFormats[$formatName])) {
throw new Exception("Unknown format: $formatName");
}
$format = $statementFormats[$formatName];
$pageRanges = implode(',', array_values($format));
$results = splitPdf($pdfUrl, $pageRanges, $apiToken, $apiUrl);
// Map results back to section names
$sections = [];
$i = 0;
foreach ($format as $sectionName => $range) {
$sections[$sectionName] = $results[$i] ?? null;
$i++;
}
return $sections;
}
// Usage
$sections = splitByFormat(
'https://storage.example.com/chase-jan-2024.pdf',
'chase_business'
);
// Access specific sections
$invoicesPdf = $sections['invoices']['file'];
$transactionsPdf = $sections['transactions']['file'];
Next Steps
Now that you can extract pages from PDFs, consider these workflows:
- Merge extracted invoices: Use the Merge endpoint to combine all extracted invoices into a single document for the accountant.
- Extract text for automation: Use the Content Read endpoint to extract invoice data for automated bookkeeping.