Skip to main content

Why normalize PDFs?

PDFs are not stable files. Two PDFs that render identically — same pages, same text, same layout — routinely produce different SHA-256 hashes. This is the core problem PDFCanon solves.

If you're storing, deduplicating, signing, hashing, or auditing PDFs, this matters.

The problem in one example

invoice-2026-04-export-A.pdf       SHA-256: a1b2c3d4e5f6…
invoice-2026-04-export-B.pdf SHA-256: 9f8e7d6c5b4a… ← different bytes, same document

Both files came from the same source, render identically, and contain the same logical content. But because their bytes differ, every hash-based system downstream — deduplication, integrity checks, content-addressed storage, audit logs — sees them as two different documents.

After running both through PDFCanon:

canonical (from A)                 SHA-256: f9c3a18b2d…
canonical (from B) SHA-256: f9c3a18b2d… ← identical

Same canonical bytes, same hash, every time, in any region.

Where the drift comes from

PDFs accumulate non-semantic differences from many sources:

  • Incremental updates. PDF allows revisions to be appended to the end of a file rather than rewritten in place. Two saves of "the same" document may carry different revision histories.
  • Metadata mutations. Producer, Creator, CreationDate, ModDate, and the XMP packet change every time a tool touches the file — even if no visible content changes.
  • Object stream re-ordering. PDF objects can be serialized in any order; different libraries pick different orders.
  • Compression flags. The same content stream can be /FlateDecode-compressed at different levels, or stored uncompressed, with no visible difference.
  • Font subset names. Embedded font subsets get random six-letter prefixes (AAAAAB+Helvetica) that change per export.
  • Linearization. "Web-optimized" linearization rewrites the file structure for streaming. Toggling it changes every byte.
  • Object IDs and /ID arrays. Document IDs are commonly random per save.
  • Embedded files, JavaScript, AcroForm state. Active content carries state that mutates over time.

Any one of these flips the SHA-256.

Why it matters

Use caseWhat breaks without normalization
Deduplication / storageThe same document is stored N times under N different hashes.
Audit & compliance"Has this document changed?" cannot be answered from a hash alone.
E-signing & evidenceSigned-hash verification breaks across re-saves, OS conversions, or PDF/A coercion.
Tamper detectionReal tampering is indistinguishable from benign mutation.
Active-content riskJavaScript, embedded executables, launch actions, and AcroForm scripts persist.
Content-addressed pipelinesIdempotency keys based on file hash retrigger needlessly.

What "canonical" means in PDFCanon

PDFCanon runs every input through a deterministic 11-stage pipeline that:

  1. Removes drift sources (metadata, font name randomness, object ordering, linearization variance).
  2. Strips active content (JavaScript, embedded files, launch actions, dangerous AcroForm logic).
  3. Repairs structural defects (broken xref, post-EOF data, shadow content).
  4. Re-emits the document with stable IDs, epoch timestamps, and a fixed object order.
  5. Emits a SHA-256 over the canonical bytes — and a separate content hash over extracted text.

The output is bit-for-bit reproducible. Same input → same output → same hash, on any host, in any region, today or next month — as long as the toolchain_version is the same.

What PDFCanon is not

PDFCanon is structural-canonicalization infrastructure. It is not:

  • Antivirus or malware sandboxing
  • Content moderation
  • OCR or text extraction
  • A general-purpose PDF editor

We strip active content because it breaks determinism, not because we're scanning for threats. If you need malware analysis, run it alongside PDFCanon, not instead of it.

Try it

Drop a PDF into the Playground (no signup, 3 PDFs/day) to see the canonical hash and the deductions report. Run the same file twice from different sources — the canonical hash will match.

Next