Extract PDF Text for RAG Pipelines & LLMs with Python
Building an AI chatbot that answers questions about your documents? The first challenge is always the same: getting clean text out of PDFs.
Libraries like PyPDF2 often produce garbled output with broken words, missing spaces, or jumbled paragraphs. That messy text ruins your embeddings and confuses your LLM.
In this tutorial, we will use the aPDF.io API to extract clean, structured text from any PDF and feed it directly into OpenAI or a vector database like Pinecone.
The Quick Solution
import requests
response = requests.post(
'https://apdf.io/api/pdf/content/read',
headers={
'Authorization': 'Bearer YOUR_API_TOKEN',
'Accept': 'application/json',
'Content-Type': 'application/x-www-form-urlencoded'
},
data={
'file': 'https://pdfobject.com/pdf/sample.pdf'
}
)
result = response.json()
# Get all text from all pages
for page in result['pages']:
print(f"Page {page['page']}: {page['content'][:100]}...")
That is it. You get clean text from every page, ready for your RAG pipeline.
Why This Matters for RAG
RAG (Retrieval-Augmented Generation) pipelines work by:
- Extracting text from your documents
- Splitting text into chunks
- Creating embeddings for each chunk
- Storing embeddings in a vector database
- Retrieving relevant chunks when users ask questions
If step 1 produces garbage, everything else fails. Clean text extraction is the foundation.
Step 1: Get Your API Token
- Sign up at aPDF.io (it's free).
- Copy your API Token from the dashboard.
Install the requests library if you haven't already:
pip install requests
Step 2: Extract Text from a PDF
Create a file named extract_pdf.py. This script extracts text and prints metadata:
import requests
import json
API_TOKEN = "YOUR_API_TOKEN_HERE"
API_URL = "https://apdf.io/api/pdf/content/read"
# The PDF you want to extract text from
pdf_url = "https://pdfobject.com/pdf/sample.pdf"
def extract_pdf_text(file_url):
"""Extract text content from a PDF file."""
print(f"Extracting text from: {file_url}")
response = requests.post(
API_URL,
headers={
'Authorization': f'Bearer {API_TOKEN}',
'Accept': 'application/json',
'Content-Type': 'application/x-www-form-urlencoded'
},
data={
'file': file_url
}
)
if response.status_code != 200:
print(f"Error {response.status_code}: {response.text}")
return None
return response.json()
# Extract text
result = extract_pdf_text(pdf_url)
if result:
print(f"\nTotal pages: {result['pages_total']}")
print(f"Total characters: {result['characters_total']}")
print("\n--- Extracted Text ---\n")
for page in result['pages']:
print(f"[Page {page['page']}] ({page['characters']} chars)")
print(page['content'])
print("\n" + "-" * 50 + "\n")
Step 3: Feed Text to OpenAI
Now let us combine PDF extraction with OpenAI to answer questions about a document:
import requests
from openai import OpenAI
API_TOKEN = "YOUR_APDF_TOKEN"
OPENAI_API_KEY = "YOUR_OPENAI_KEY"
def extract_pdf_text(file_url):
"""Extract text from PDF using aPDF.io"""
response = requests.post(
'https://apdf.io/api/pdf/content/read',
headers={
'Authorization': f'Bearer {API_TOKEN}',
'Accept': 'application/json',
'Content-Type': 'application/x-www-form-urlencoded'
},
data={'file': file_url}
)
if response.status_code != 200:
raise Exception(f"API Error: {response.text}")
result = response.json()
# Combine all page content into one string
full_text = "\n\n".join([
page['content'] for page in result['pages']
])
return full_text
def ask_about_document(pdf_url, question):
"""Extract PDF text and ask OpenAI a question about it."""
print("Extracting PDF text...")
document_text = extract_pdf_text(pdf_url)
print(f"Asking: {question}")
client = OpenAI(api_key=OPENAI_API_KEY)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You answer questions based on the provided document. Be concise."
},
{
"role": "user",
"content": f"Document:\n{document_text}\n\nQuestion: {question}"
}
]
)
return response.choices[0].message.content
# Example usage
pdf_url = "https://pdfobject.com/pdf/sample.pdf"
answer = ask_about_document(pdf_url, "What is this document about?")
print(f"\nAnswer: {answer}")
Step 4: Build a Simple RAG Pipeline
For production use cases with larger documents, you will want to chunk the text and store embeddings. Here is a minimal RAG pipeline using OpenAI embeddings:
import requests
from openai import OpenAI
import numpy as np
API_TOKEN = "YOUR_APDF_TOKEN"
OPENAI_API_KEY = "YOUR_OPENAI_KEY"
client = OpenAI(api_key=OPENAI_API_KEY)
def extract_pdf_text(file_url):
"""Extract text from PDF."""
response = requests.post(
'https://apdf.io/api/pdf/content/read',
headers={
'Authorization': f'Bearer {API_TOKEN}',
'Accept': 'application/json',
'Content-Type': 'application/x-www-form-urlencoded'
},
data={'file': file_url}
)
result = response.json()
return [page['content'] for page in result['pages']]
def get_embedding(text):
"""Get OpenAI embedding for text."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a, b):
"""Calculate cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def rag_query(pages, embeddings, question):
"""Find most relevant page and answer question."""
question_embedding = get_embedding(question)
# Find the most similar page
similarities = [
cosine_similarity(question_embedding, emb)
for emb in embeddings
]
best_page_idx = np.argmax(similarities)
context = pages[best_page_idx]
# Ask GPT with the relevant context
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer based on the context provided."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
# Build the RAG pipeline
print("Extracting PDF...")
pages = extract_pdf_text("https://pdfobject.com/pdf/sample.pdf")
print("Creating embeddings...")
embeddings = [get_embedding(page) for page in pages]
print("Ready for questions!\n")
# Query the document
answer = rag_query(pages, embeddings, "What is this document about?")
print(f"Answer: {answer}")
For larger documents, you would store these embeddings in a vector database like Pinecone, Weaviate, or ChromaDB instead of keeping them in memory.
API Response Structure
The /pdf/content/read endpoint returns structured JSON with page-level text:
{
"pages_total": 2,
"characters_total": 3847,
"pages": [
{
"page": 1,
"characters": 2103,
"content": "This is the text from page 1..."
},
{
"page": 2,
"characters": 1744,
"content": "This is the text from page 2..."
}
]
}
The page-level structure is useful for chunking strategies where you want to keep page boundaries intact.
Conclusion
Clean text extraction is the foundation of every RAG pipeline. With the aPDF.io API, you can extract text from any PDF in seconds and feed it directly into OpenAI, embedding models, or vector databases.
No more fighting with PyPDF2 or pdfminer. Just a simple API call and clean text.
Next Steps
- Search Within PDFs: Use the Search endpoint to find specific text patterns across your documents.
- Extract Specific Pages: Use the Page Extract endpoint to pull out only the pages you need before processing.