Skip to main content

The normalization pipeline

Every PDF submitted to PDFCanon flows through the same deterministic pipeline. The high-level shape:

The pipeline itself is fixed and ordered — stages do not run in parallel and do not skip:

Stage reference

StageNameWhat it does
0PDF/A detectionIdentify the declared compliance level of the input.
1Tamper detectionDetect incremental-update injection, shadow content, post-EOF data.
2Structural repairFix malformed cross-reference tables and trailer dictionaries.
3Digital signature detectionIdentify and handle existing digital signatures per policy.
4Active content removalStrip JavaScript, embedded executables, launch actions.
5AcroForm handlingFlatten or preserve interactive form fields.
6Metadata canonicalizationNormalize XMP and DocInfo metadata to epoch timestamps.
7Font resource validationValidate fonts and detect non-embedded subsets.
8Final rewriteLinearize and emit a clean canonical PDF with deterministic IDs.
9Content hashSHA-256 over extracted text for semantic deduplication.
10PDF/A compliance validationValidate PDF/A compliance of the output (when input declared PDF/A).

Determinism guarantees

For a given input PDF and a given toolchain_version, every stage is deterministic. The same input always produces the same canonical bytes and the same SHA-256, on any host, in any region.

The pipeline is implemented in src/PDFCanon.Worker/Pipeline/Stages/ — the diagram above mirrors the actual stage order one-for-one.

Next