Extract PDF Metadata for Document Management with Java

When building a document management system, you need more than just the file itself. You need metadata: page count, creation date, author, title, whether it's encrypted. This information powers search, sorting, filtering, and compliance features.

Reading PDF metadata in Java typically requires PDFBox or iText. These libraries work but add complexity to your build, increase memory usage, and need careful version management.

A lighter approach: extract metadata via API. Send the PDF URL, get back structured JSON with everything you need. No libraries, no parsing, just data.

Quick Example

Here's how to read PDF metadata with Java:

import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;

public class PDFMetadata {
    public static void main(String[] args) throws Exception {
        String apiToken = "YOUR_API_TOKEN";
        String apiUrl = "https://apdf.io/api/pdf/metadata/read";

        String jsonBody = "{\"file\": \"https://pdfobject.com/pdf/sample.pdf\"}";

        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(apiUrl))
            .header("Authorization", "Bearer " + apiToken)
            .header("Content-Type", "application/json")
            .header("Accept", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(jsonBody))
            .build();

        HttpResponse<String> response = client.send(request,
            HttpResponse.BodyHandlers.ofString());

        System.out.println(response.body());
    }
}

The response includes page count, file size, PDF version, encryption status, and more.

Understanding the Response

Here's what the metadata response looks like:

{
  "title": "Quarterly Report Q4 2024",
  "creator": "Microsoft Word",
  "producer": "Adobe PDF Library",
  "created": "Mon Dec 15 09:30:00 2024 CET",
  "modified": "Tue Dec 20 14:15:00 2024 CET",
  "pages": 24,
  "encrypted": false,
  "page_size": "595.92 x 841.92 pts (A4)",
  "file_size": 1548290,
  "pdf_version": "1.7"
}

Key fields explained:

pages: Total page count for pagination and preview features
file_size: Size in bytes, useful for storage quotas and download estimates
encrypted: Whether the PDF has password protection
created/modified: Timestamps for audit trails and sorting
page_size: Dimensions, helpful for print/display formatting

Real-World Scenario: Building a Document Catalog

You're building a legal document repository. When users upload PDFs, you need to extract metadata to populate the catalog and enable searching by author, date, or page count.

import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.net.URI;
import com.google.gson.Gson;
import com.google.gson.JsonObject;

public class DocumentCatalog {

    private static final String API_TOKEN = "YOUR_API_TOKEN";
    private static final String API_URL = "https://apdf.io/api/pdf/metadata/read";
    private static final HttpClient client = HttpClient.newHttpClient();
    private static final Gson gson = new Gson();

    public static JsonObject extractMetadata(String pdfUrl) throws Exception {
        String jsonBody = String.format("{\"file\": \"%s\"}", pdfUrl);

        HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(API_URL))
            .header("Authorization", "Bearer " + API_TOKEN)
            .header("Content-Type", "application/json")
            .header("Accept", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(jsonBody))
            .build();

        HttpResponse<String> response = client.send(request,
            HttpResponse.BodyHandlers.ofString());

        return gson.fromJson(response.body(), JsonObject.class);
    }

    public static void catalogDocument(String documentId, String pdfUrl) {
        try {
            System.out.println("Cataloging document: " + documentId);
            JsonObject metadata = extractMetadata(pdfUrl);

            // Extract key fields for your database
            String title = metadata.has("title") ?
                metadata.get("title").getAsString() : "Untitled";
            int pages = metadata.get("pages").getAsInt();
            long fileSize = metadata.get("file_size").getAsLong();
            boolean encrypted = metadata.get("encrypted").getAsBoolean();
            String created = metadata.has("created") ?
                metadata.get("created").getAsString() : null;

            // In production: save to database
            System.out.println("  Title: " + title);
            System.out.println("  Pages: " + pages);
            System.out.println("  Size: " + (fileSize / 1024) + " KB");
            System.out.println("  Encrypted: " + encrypted);
            System.out.println("  Created: " + created);

            if (encrypted) {
                System.out.println("  WARNING: Document is password-protected");
            }

        } catch (Exception e) {
            System.err.println("Failed to catalog " + documentId + ": " + e.getMessage());
        }
    }

    public static void main(String[] args) {
        // Process uploaded documents
        String[][] documents = {
            {"DOC-001", "https://pdfobject.com/pdf/sample.pdf"},
            {"DOC-002", "https://ontheline.trincoll.edu/images/bookdown/sample-local-pdf.pdf"}
        };

        for (String[] doc : documents) {
            catalogDocument(doc[0], doc[1]);
            System.out.println();
        }
    }
}

Detecting Encrypted Documents

The encrypted field is particularly useful for workflow automation. If a document is encrypted, you may need to handle it differently:

public static void processDocument(String pdfUrl, String password) throws Exception {
    JsonObject metadata = extractMetadata(pdfUrl);

    if (metadata.get("encrypted").getAsBoolean()) {
        System.out.println("Document is encrypted.");
        System.out.println("Use the security/remove endpoint with the password to unlock it.");
        // Redirect to unlock workflow
    } else {
        System.out.println("Document is open. Processing...");
        // Continue with normal processing
    }
}

Next Steps

With metadata extraction in place, you can extend your document system:

Handle encrypted files: Use the Security Remove endpoint to unlock password-protected PDFs when you have the password.
Extract text for search: Use the Content Read endpoint to index the document text for full-text search.

Ready to build?

Get Started for Free