Extract PDF Text for RAG Pipelines & LLMs with Python

Building an AI chatbot that answers questions about your documents? The first challenge is always the same: getting clean text out of PDFs.

Libraries like PyPDF2 often produce garbled output with broken words, missing spaces, or jumbled paragraphs. That messy text ruins your embeddings and confuses your LLM.

In this tutorial, we will use the aPDF.io API to extract clean, structured text from any PDF and feed it directly into OpenAI or a vector database like Pinecone.

The Quick Solution

Here is how to extract text from a PDF in just a few lines of Python:

import requests

response = requests.post(
    'https://apdf.io/api/pdf/content/read',
    headers={
        'Authorization': 'Bearer YOUR_API_TOKEN',
        'Accept': 'application/json',
        'Content-Type': 'application/x-www-form-urlencoded'
    },
    data={
        'file': 'https://pdfobject.com/pdf/sample.pdf'
    }
)

result = response.json()

# Get all text from all pages
for page in result['pages']:
    print(f"Page {page['page']}: {page['content'][:100]}...")

That is it. You get clean text from every page, ready for your RAG pipeline.

Why This Matters for RAG

RAG (Retrieval-Augmented Generation) pipelines work by:

Extracting text from your documents
Splitting text into chunks
Creating embeddings for each chunk
Storing embeddings in a vector database
Retrieving relevant chunks when users ask questions

If step 1 produces garbage, everything else fails. Clean text extraction is the foundation.

Step 1: Get Your API Token

Sign up at aPDF.io (it's free).
Copy your API Token from the dashboard.

Install the requests library if you haven't already:

pip install requests

Step 2: Extract Text from a PDF

Create a file named extract_pdf.py. This script extracts text and prints metadata:

import requests
import json

API_TOKEN = "YOUR_API_TOKEN_HERE"
API_URL = "https://apdf.io/api/pdf/content/read"

# The PDF you want to extract text from
pdf_url = "https://pdfobject.com/pdf/sample.pdf"

def extract_pdf_text(file_url):
    """Extract text content from a PDF file."""
    print(f"Extracting text from: {file_url}")

    response = requests.post(
        API_URL,
        headers={
            'Authorization': f'Bearer {API_TOKEN}',
            'Accept': 'application/json',
            'Content-Type': 'application/x-www-form-urlencoded'
        },
        data={
            'file': file_url
        }
    )

    if response.status_code != 200:
        print(f"Error {response.status_code}: {response.text}")
        return None

    return response.json()

# Extract text
result = extract_pdf_text(pdf_url)

if result:
    print(f"\nTotal pages: {result['pages_total']}")
    print(f"Total characters: {result['characters_total']}")
    print("\n--- Extracted Text ---\n")

    for page in result['pages']:
        print(f"[Page {page['page']}] ({page['characters']} chars)")
        print(page['content'])
        print("\n" + "-" * 50 + "\n")

Step 3: Feed Text to OpenAI

Now let us combine PDF extraction with OpenAI to answer questions about a document:

import requests
from openai import OpenAI

API_TOKEN = "YOUR_APDF_TOKEN"
OPENAI_API_KEY = "YOUR_OPENAI_KEY"

def extract_pdf_text(file_url):
    """Extract text from PDF using aPDF.io"""
    response = requests.post(
        'https://apdf.io/api/pdf/content/read',
        headers={
            'Authorization': f'Bearer {API_TOKEN}',
            'Accept': 'application/json',
            'Content-Type': 'application/x-www-form-urlencoded'
        },
        data={'file': file_url}
    )

    if response.status_code != 200:
        raise Exception(f"API Error: {response.text}")

    result = response.json()

    # Combine all page content into one string
    full_text = "\n\n".join([
        page['content'] for page in result['pages']
    ])

    return full_text

def ask_about_document(pdf_url, question):
    """Extract PDF text and ask OpenAI a question about it."""
    print("Extracting PDF text...")
    document_text = extract_pdf_text(pdf_url)

    print(f"Asking: {question}")

    client = OpenAI(api_key=OPENAI_API_KEY)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You answer questions based on the provided document. Be concise."
            },
            {
                "role": "user",
                "content": f"Document:\n{document_text}\n\nQuestion: {question}"
            }
        ]
    )

    return response.choices[0].message.content

# Example usage
pdf_url = "https://pdfobject.com/pdf/sample.pdf"
answer = ask_about_document(pdf_url, "What is this document about?")
print(f"\nAnswer: {answer}")

Step 4: Build a Simple RAG Pipeline

For production use cases with larger documents, you will want to chunk the text and store embeddings. Here is a minimal RAG pipeline using OpenAI embeddings:

import requests
from openai import OpenAI
import numpy as np

API_TOKEN = "YOUR_APDF_TOKEN"
OPENAI_API_KEY = "YOUR_OPENAI_KEY"

client = OpenAI(api_key=OPENAI_API_KEY)

def extract_pdf_text(file_url):
    """Extract text from PDF."""
    response = requests.post(
        'https://apdf.io/api/pdf/content/read',
        headers={
            'Authorization': f'Bearer {API_TOKEN}',
            'Accept': 'application/json',
            'Content-Type': 'application/x-www-form-urlencoded'
        },
        data={'file': file_url}
    )
    result = response.json()
    return [page['content'] for page in result['pages']]

def get_embedding(text):
    """Get OpenAI embedding for text."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a, b):
    """Calculate cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def rag_query(pages, embeddings, question):
    """Find most relevant page and answer question."""
    question_embedding = get_embedding(question)

    # Find the most similar page
    similarities = [
        cosine_similarity(question_embedding, emb)
        for emb in embeddings
    ]
    best_page_idx = np.argmax(similarities)
    context = pages[best_page_idx]

    # Ask GPT with the relevant context
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer based on the context provided."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ]
    )

    return response.choices[0].message.content

# Build the RAG pipeline
print("Extracting PDF...")
pages = extract_pdf_text("https://pdfobject.com/pdf/sample.pdf")

print("Creating embeddings...")
embeddings = [get_embedding(page) for page in pages]

print("Ready for questions!\n")

# Query the document
answer = rag_query(pages, embeddings, "What is this document about?")
print(f"Answer: {answer}")

For larger documents, you would store these embeddings in a vector database like Pinecone, Weaviate, or ChromaDB instead of keeping them in memory.

API Response Structure

The /pdf/content/read endpoint returns structured JSON with page-level text:

{
    "pages_total": 2,
    "characters_total": 3847,
    "pages": [
        {
            "page": 1,
            "characters": 2103,
            "content": "This is the text from page 1..."
        },
        {
            "page": 2,
            "characters": 1744,
            "content": "This is the text from page 2..."
        }
    ]
}

The page-level structure is useful for chunking strategies where you want to keep page boundaries intact.

Conclusion

Clean text extraction is the foundation of every RAG pipeline. With the aPDF.io API, you can extract text from any PDF in seconds and feed it directly into OpenAI, embedding models, or vector databases.

No more fighting with PyPDF2 or pdfminer. Just a simple API call and clean text.

Next Steps

Once you have extracted text from your PDFs, consider these follow-up tasks:

Search Within PDFs: Use the Search endpoint to find specific text patterns across your documents.
Extract Specific Pages: Use the Page Extract endpoint to pull out only the pages you need before processing.

Ready to build?

Get Started for Free