The normalization pipeline

Every PDF submitted to PDFCanon flows through the same deterministic pipeline. The high-level shape:

The pipeline itself is fixed and ordered — stages do not run in parallel and do not skip:

Stage reference

Stage	Name	What it does
0	PDF/A detection	Identify the declared compliance level of the input.
1	Tamper detection	Detect incremental-update injection, shadow content, post-EOF data.
2	Structural repair	Fix malformed cross-reference tables and trailer dictionaries.
3	Digital signature detection	Identify and handle existing digital signatures per policy.
4	Active content removal	Strip JavaScript, embedded executables, launch actions.
5	AcroForm handling	Flatten or preserve interactive form fields.
6	Metadata canonicalization	Normalize XMP and DocInfo metadata to epoch timestamps.
7	Font resource validation	Validate fonts and detect non-embedded subsets.
8	Final rewrite	Linearize and emit a clean canonical PDF with deterministic IDs.
9	Content hash	SHA-256 over extracted text for semantic deduplication.
10	PDF/A compliance validation	Validate PDF/A compliance of the output (when input declared PDF/A).

Determinism guarantees

For a given input PDF and a given toolchain_version, every stage is deterministic. The same input always produces the same canonical bytes and the same SHA-256, on any host, in any region.

The pipeline is implemented in src/PDFCanon.Worker/Pipeline/Stages/ — the diagram above mirrors the actual stage order one-for-one.

Stage reference​

Determinism guarantees​

Next​

Stage reference

Determinism guarantees

Next