Skip to main content

Normalizing PDFs

This guide walks through the full normalization workflow: uploading a PDF, monitoring the pipeline stages, and retrieving the canonical output.

Overview

PDFCanon normalizes PDFs through a deterministic 10-stage pipeline. Every stage is logged in the normalization report, which you can retrieve after the job completes.

Synchronous vs asynchronous mode

By default, POST /api/normalize processes the document synchronously and returns the normalized PDF directly in the response body. For large documents or high-throughput scenarios, use the Prefer: respond-async header to receive a submission ID and poll for completion.

Synchronous (default)

curl -X POST https://api.pdfcanon.com/api/normalize \
-H "X-Api-Key: pdfn_your_api_key_here" \
-H "Content-Type: application/pdf" \
--data-binary @input.pdf \
-o normalized.pdf

Asynchronous

curl -X POST https://api.pdfcanon.com/api/normalize \
-H "X-Api-Key: pdfn_your_api_key_here" \
-H "Content-Type: application/pdf" \
-H "Prefer: respond-async" \
--data-binary @input.pdf
# Returns: {"submissionId": "sub_...", "status": "processing"}

Then poll for completion:

curl https://api.pdfcanon.com/api/submissions/{submissionId} \
-H "X-Api-Key: pdfn_your_api_key_here"

Idempotency

Pass an Idempotency-Key header to safely retry requests without double-processing:

curl -X POST https://api.pdfcanon.com/api/normalize \
-H "X-Api-Key: pdfn_your_api_key_here" \
-H "Idempotency-Key: my-unique-key-12345" \
-H "Content-Type: application/pdf" \
--data-binary @input.pdf

See Idempotency for details.

Pipeline stages

The normalization report includes the result of each stage:

StageNameDescription
0PDF/A DetectionIdentifies compliance level of the input
1Tamper DetectionDetects incremental-update injection and shadow content
2Structural RepairFixes malformed xref tables and trailers
3Digital Signature DetectionHandles existing digital signatures per policy
4Active Content RemovalStrips JavaScript, embedded executables
5AcroForm HandlingFlattens or preserves form fields
6Metadata CanonicalizationNormalizes XMP and DocInfo metadata
7Font Resource ValidationValidates and embeds font subsets
8Final RewriteLinearizes and emits canonical PDF
9Content HashComputes SHA-256 of the canonical output

Output hash and deduplication

The outputHash in the response is the SHA-256 hash of the canonical output PDF. Identical input documents (after normalization) produce identical hashes, enabling deduplication.

Next steps