<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://docs.pdfcanon.com/blog</id>
    <title>PDFCanon Docs Blog</title>
    <updated>2026-04-29T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://docs.pdfcanon.com/blog"/>
    <subtitle>PDFCanon Docs Blog</subtitle>
    <icon>https://docs.pdfcanon.com/img/logo.svg</icon>
    <entry>
        <title type="html"><![CDATA[Why Two Identical PDFs Have Different SHA-256 Hashes]]></title>
        <id>https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently</id>
        <link href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently"/>
        <updated>2026-04-29T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A deep dive into the seven sources of non-determinism in the PDF format, why this breaks audit trails and deduplication, and how an 11-stage normalization pipeline produces stable canonical hashes.]]></summary>
        <content type="html"><![CDATA[<p>I spent a long time figuring out why <code>sha256sum invoice.pdf</code> returns a different hash every time my accounting software re-exports "the same" document.</p>
<p>Turns out this is a fundamental property of the PDF format - and it quietly breaks a number of real-world systems that depend on stable file hashes: deduplication, audit trails, content-addressed storage, integrity verification.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-problem-in-30-seconds">The problem in 30 seconds<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#the-problem-in-30-seconds" class="hash-link" aria-label="Direct link to The problem in 30 seconds" title="Direct link to The problem in 30 seconds" translate="no">​</a></h2>
<p>Open any PDF in a hex editor after a round-trip through Adobe Acrobat, Preview.app, or even just re-saving from the same tool. The bytes change. The hash changes. The visual content is identical.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Save invoice.pdf from Word today</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">$ sha256sum invoice.pdf</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">a742f</span><span class="token punctuation" style="color:#393A34">..</span><span class="token plain">.e91d  invoice.pdf</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Re-export the exact same document tomorrow</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">$ sha256sum invoice.pdf</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">8c20b1</span><span class="token punctuation" style="color:#393A34">..</span><span class="token plain">.f4a3  invoice.pdf</span><br></span></code></pre></div></div>
<p>These aren't different documents. They render pixel-for-pixel identically. But every byte-level deduplication, audit trail, content addressed storage, integrity verification sees two completely different files.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="whats-actually-different">What's actually different?<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#whats-actually-different" class="hash-link" aria-label="Direct link to What's actually different?" title="Direct link to What's actually different?" translate="no">​</a></h2>
<p>I dug into the ISO 32000-2 spec and PDFs have at least <strong>seven sources of non-determinism</strong>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="1-timestamps">1. Timestamps<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#1-timestamps" class="hash-link" aria-label="Direct link to 1. Timestamps" title="Direct link to 1. Timestamps" translate="no">​</a></h3>
<p><code>/CreationDate</code> and <code>/ModDate</code> in the document info dictionary change on every save. So does the XMP metadata packet.  An embedded XML blob that mirrors these dates plus tool-specific metadata.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token comment" style="color:#999988;font-style:italic"># Today's save</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">/CreationDate </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">D:20260408093000-04</span><span class="token string" style="color:#e3116c">'00'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">/ModDate </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">D:20260408093000-04</span><span class="token string" style="color:#e3116c">'00'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token comment" style="color:#999988;font-style:italic"># Tomorrow's save - same document</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">/CreationDate </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">D:20260409141522-04</span><span class="token string" style="color:#e3116c">'00'</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">/ModDate </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">D:20260409141522-04</span><span class="token string" style="color:#e3116c">'00'</span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre></div></div>
<p>In the above example, two bytes were changed. The hash is completely different.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="2-producer-strings">2. Producer strings<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#2-producer-strings" class="hash-link" aria-label="Direct link to 2. Producer strings" title="Direct link to 2. Producer strings" translate="no">​</a></h3>
<p>Every tool (understandably) writes its own signature into the <code>/Producer</code> and <code>/Creator</code> fields:</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">/Producer </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Microsoft® Word </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> Microsoft </span><span class="token number" style="color:#36acaa">365</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">/Producer </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">macOS Version </span><span class="token number" style="color:#36acaa">14.2</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">\</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Build 23C64</span><span class="token punctuation" style="color:#393A34">\</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> Quartz PDFContext</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">/Producer </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">LibreOffice </span><span class="token number" style="color:#36acaa">7.6</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">/Producer </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">iLovePDF</span><span class="token punctuation" style="color:#393A34">)</span><br></span></code></pre></div></div>
<p>Print a Word document to PDF on two different machines, both have the same visual output but different producer strings and thus different hashes.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="3-incremental-updates">3. Incremental updates<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#3-incremental-updates" class="hash-link" aria-label="Direct link to 3. Incremental updates" title="Direct link to 3. Incremental updates" translate="no">​</a></h3>
<p>This is the big one.</p>
<p>PDFs support "incremental saves" where each edit <strong>appends</strong> new objects to the end of the file rather than rewriting it. The previous revision doesn't get deleted.  It's still there in the bytes, invisible to the viewer but present in the file.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">┌──────────────────────────┐</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   Original document      │  ← revision </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   %%EOF                  │</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">├──────────────────────────┤</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   Edited page </span><span class="token number" style="color:#36acaa">3</span><span class="token plain">          │  ← revision </span><span class="token number" style="color:#36acaa">2</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">appended</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   %%EOF                  │</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">├──────────────────────────┤</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   Changed a font         │  ← revision </span><span class="token number" style="color:#36acaa">3</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">appended</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   %%EOF                  │</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">├──────────────────────────┤</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   </span><span class="token string" style="color:#e3116c">"Final"</span><span class="token plain"> save           │  ← revision </span><span class="token number" style="color:#36acaa">4</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">appended</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">│   %%EOF                  │</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">└──────────────────────────┘</span><br></span></code></pre></div></div>
<p>Your 200 KB contract might have 1.4 MB of invisible revision history of previous drafts, deleted pages, redacted content. Each revision adds another <code>%%EOF</code> marker.</p>
<p><strong>I've seen production documents with 47 of them.</strong></p>
<p>The kicker: two people can start with the same base document, make the same single edit, and produce files with completely different byte layouts depending on how many prior incremental saves exist in their copy.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="4-object-ordering">4. Object ordering<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#4-object-ordering" class="hash-link" aria-label="Direct link to 4. Object ordering" title="Direct link to 4. Object ordering" translate="no">​</a></h3>
<p>A PDF is a collection of numbered objects. The spec doesn't require them to be serialized in any particular order. Object 1 can come before or after Object 47. Different tools and even the same tool across versions arrange them differently.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="5-cross-reference-tables">5. Cross-reference tables<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#5-cross-reference-tables" class="hash-link" aria-label="Direct link to 5. Cross-reference tables" title="Direct link to 5. Cross-reference tables" translate="no">​</a></h3>
<p>The xref structure maps object numbers to byte offsets. It can be a plain text table or a compressed stream. Entries can be in any order. Any time the objects move (see #4) the offset changes.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="6-object-streams">6. Object streams<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#6-object-streams" class="hash-link" aria-label="Direct link to 6. Object streams" title="Direct link to 6. Object streams" translate="no">​</a></h3>
<p>Objects can be packed into compressed <code>ObjStm</code> containers or written as individual top level objects. This is a space optimization, but different tools make different choices and the choice of course changes the bytes.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="7-the-id-array">7. The /ID array<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#7-the-id-array" class="hash-link" aria-label="Direct link to 7. The /ID array" title="Direct link to 7. The /ID array" translate="no">​</a></h3>
<p>A pair of hex strings meant to uniquely identify the document. Supposed to be stable across saves, but many tools regenerate them.</p>
<p>In practice, about half the PDF tools I tested regenerate both strings on every save.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-stuff-that-shouldnt-be-there-at-all">The stuff that shouldn't be there at all<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#the-stuff-that-shouldnt-be-there-at-all" class="hash-link" aria-label="Direct link to The stuff that shouldn't be there at all" title="Direct link to The stuff that shouldn't be there at all" translate="no">​</a></h2>
<p>Beyond non-determinism, PDFs can contain things that have no business being in a document you're storing:</p>
<ul>
<li class=""><strong>Embedded JavaScript</strong> - <code>/JavaScript</code> dictionaries that execute code, often with access to the filesystem and network.</li>
<li class=""><strong>Actions</strong> - <code>/AA</code> (additional actions) that run on events like page open, page close, document open, document close, etc.</li>
<li class=""><strong>Launch actions</strong> - <code>/Launch</code> entries that can open external applications, ugh why?</li>
<li class=""><strong>Open actions</strong> - <code>/OpenAction</code> that runs when the document is opened, seriously?</li>
<li class=""><strong>File attachments</strong> - hidden in the <code>/EmbeddedFiles</code> name tree</li>
<li class=""><strong>Rich media annotations</strong> - embedded Flash, video, 3D models</li>
<li class=""><strong>Interactive form fields</strong> - AcroForm state that changes on user interaction</li>
</ul>
<p>These don't affect visual rendering, but they change the hash. And they're a real security risk in every PDF your users upload.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-this-matters">Why this matters<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#why-this-matters" class="hash-link" aria-label="Direct link to Why this matters" title="Direct link to Why this matters" translate="no">​</a></h2>
<p>If you're a developer building a SaaS that accepts document uploads, this bites you in at least four ways:</p>
<p><strong>SOC 2 audits.</strong> "How do you verify document integrity?" is a real question auditors ask. If your answer depends on file hashes, you need those hashes to be stable. A file that hashes differently on re-download is a finding.</p>
<p><strong>Deduplication.</strong> Content-addressed storage is impossible when the same logical document produces different hashes. You end up storing 12 copies of the same contract because each was exported from a slightly different tool.</p>
<p><strong>Audit trails.</strong> "Prove this is the same document that was submitted on March 3rd" fails if a re-download produces different bytes. Your audit chain breaks at the first hash comparison.</p>
<p><strong>Legal discovery.</strong> Those invisible incremental updates can contain previous drafts, deleted content, or metadata that the submitter didn't intend to share. Accepting and storing the raw file means you're potentially holding onto data you shouldn't have.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-we-built">What we built<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#what-we-built" class="hash-link" aria-label="Direct link to What we built" title="Direct link to What we built" translate="no">​</a></h2>
<p>I built an <strong>11 stage normalization pipeline</strong> that takes an arbitrary PDF and produces a deterministic canonical form.</p>
<p><strong>Same input → same bytes → same SHA-256. Guaranteed.</strong></p>
<p>Here's what each stage does:</p>
<table><thead><tr><th>Stage</th><th>Name</th><th>Purpose</th></tr></thead><tbody><tr><td>0</td><td>PDF/A Detection</td><td>Identify archival documents so we don't break ISO 19005 compliance downstream</td></tr><tr><td>1</td><td>Tamper Analysis</td><td>Count <code>%%EOF</code> markers, detect post-EOF data appended after the logical end, find shadow content, flag incremental update injection. Outputs a risk score and anomaly report</td></tr><tr><td>2</td><td>Structural Repair</td><td>qpdf collapses all incremental updates into a single revision, rewrites the cross-reference table, normalizes object streams, decrypts encrypted PDFs</td></tr><tr><td>3</td><td>Digital Signature Verification</td><td>Full PKCS#7/CMS cryptographic verification via BouncyCastle - not just detection, actual chain-of-trust validation. Configurable policy: reject, strip, or preserve</td></tr><tr><td>4</td><td>Active Content Removal</td><td>Custom object tree walker strips <code>/JavaScript</code>, <code>/AA</code>, <code>/OpenAction</code>, <code>/Launch</code>, <code>/EmbeddedFiles</code>, rich media. Recursively traverses the entire object graph not just top-level flags</td></tr><tr><td>5</td><td>AcroForm Flattening</td><td>Render interactive form fields into page content as static graphics, then removes <code>/AcroForm</code> from the catalog</td></tr><tr><td>6</td><td>Metadata Canonicalization</td><td>Set dates to epoch zero, set producer to <code>PDFCanon</code>, strip XMP entirely, overwrite <code>/ID</code> with a deterministic value derived from content.</td></tr><tr><td>7</td><td>Font Validation</td><td>Re-embed non-embedded Standard 14 fonts using metric-compatible substitutes (Liberation Sans, URW Base35). A PDF <em>referencing</em> Helvetica without <em>embedding</em> it is technically valid per the spec, but renders differently on every system</td></tr><tr><td>8</td><td>Final Canonical Rewrite</td><td>Second qpdf pass: <code>--normalize-content=y --recompress-flate --deterministic-id</code>. Forces stable object ordering, stable xref, and re-compresses all Flate streams to eliminate compression-level variance between tools</td></tr><tr><td>9</td><td>Content Hash</td><td>Extract text via <code>pdftotext</code>, normalize whitespace, compute SHA-256. This is a <em>logical</em> content hash, stable across visually equivalent documents even if they were produced by completely different tools</td></tr><tr><td>10</td><td>veraPDF Validation</td><td>For documents that declared PDF/A, validate the final output against ISO 19005 using veraPDF. Failures become warnings, not errors.  We don't silently break archival compliance</td></tr></tbody></table>
<p><strong>qpdf is pinned to an exact version</strong> in a sandboxed container, referenced by image digest - not by tag. Toolchain drift causes "deterministic" systems to silently become non-deterministic. If zlib changes its default compression level between versions (and it has), your "canonical" output drifts with it.</p>
<p><strong>The pipeline runs two qpdf passes.</strong> The first pass (Stage 2) repairs structure and collapses incremental updates. Intermediate stages modify the document in (stripping active content, canonicalizing metadata, re-embedding fonts). The second pass (Stage 8) re-normalizes everything that we touched because the PDF library we use for Flate compression isn't bitwise reproducible across runs. The second pass through qpdf's <code>--recompress-flate</code> forces it to be.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-api-response">The API response<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#the-api-response" class="hash-link" aria-label="Direct link to The API response" title="Direct link to The API response" translate="no">​</a></h2>
<p>You send a PDF, you get back the normalized file plus a compliance report:</p>
<div class="language-json codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-json codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"original"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"sha256"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"a742f1...e91d"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"sizeBytes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1482301</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"normalized"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"sha256"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"c891a0...7f02"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"sizeBytes"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">198412</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"contentHash"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"e45d21...b318"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"security"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"javascriptRemoved"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"openActionsRemoved"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"embeddedFilesRemoved"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">false</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"incrementalUpdatesRemoved"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"acroformFlattened"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"digitalSignaturesDetected"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">false</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"tamperAnalysis"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"riskLevel"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"medium"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"anomaliesDetected"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"anomalies"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"INCREMENTAL_UPDATE_INJECTION"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"severity"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"medium"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"Document contains 12 incremental updates (%%EOF markers)."</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"type"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"POST_EOF_DATA"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"severity"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"low"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token property" style="color:#36acaa">"description"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1,847 bytes of non-whitespace data after final %%EOF marker."</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"validation"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"pdfaDeclared"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"pdfaLevel"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"2B"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"pdfaPreserved"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token property" style="color:#36acaa">"verapdfValidated"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean" style="color:#36acaa">true</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token property" style="color:#36acaa">"healthScore"</span><span class="token operator" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">72</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre></div></div>
<p>Every PDF that enters your system gets a stable hash, a security report, and a tamper risk score.</p>
<p>Notice the <code>contentHash</code> field - that's the logical content hash from Stage 9. Even if two PDFs were produced by completely different tools (Word vs. LibreOffice), if the text content is equivalent, the content hash matches. The <code>sha256</code> on <code>normalized</code> is the byte-level hash of the canonical output, which is deterministic for the same input.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-its-not">What it's not<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#what-its-not" class="hash-link" aria-label="Direct link to What it's not" title="Direct link to What it's not" translate="no">​</a></h2>
<p>This isn't a virus scanner, viewer, editor, or converter. It's infrastructure - one API call that sits between "user uploads PDF" and "you storing it."</p>
<p>Think of it as <code>go fmt</code> for PDFs, except the input is untrusted and adversarial. The goal is to give you a stable hash and a clean, normalized file that you can safely store, display, or feed into downstream systems without worrying about non-determinism or hidden nasties.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="try-it">Try it<a href="https://docs.pdfcanon.com/blog/why-identical-pdfs-hash-differently#try-it" class="hash-link" aria-label="Direct link to Try it" title="Direct link to Try it" translate="no">​</a></h2>
<p>We're live at <strong><a href="https://pdfcanon.com/" target="_blank" rel="noopener noreferrer" class="">pdfcanon.com</a></strong> with a free tier - 100 PDFs/month, no credit card required.</p>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token function" style="color:#d73a49">curl</span><span class="token plain"> </span><span class="token parameter variable" style="color:#36acaa">-X</span><span class="token plain"> POST https://api.pdfcanon.com/api/normalize </span><span class="token punctuation" style="color:#393A34">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token parameter variable" style="color:#36acaa">-H</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"X-Api-Key: pdfn_your_key_here"</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">\</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token parameter variable" style="color:#36acaa">-F</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"file=@contract.pdf"</span><br></span></code></pre></div></div>
<p><a href="https://docs.pdfcanon.com/" target="_blank" rel="noopener noreferrer" class="">Documentation</a> · <a href="https://docs.pdfcanon.com/docs/api-reference/normalize" target="_blank" rel="noopener noreferrer" class="">API Reference</a> · <a href="https://docs.pdfcanon.com/sdks/dotnet" target="_blank" rel="noopener noreferrer" class="">SDKs</a> · <a href="https://docs.pdfcanon.com/mcp-server" target="_blank" rel="noopener noreferrer" class="">MCP Server</a> · <a href="https://docs.pdfcanon.com/docs/known-deviations" target="_blank" rel="noopener noreferrer" class="">Known Deviations</a></p>
<p>Happy to answer questions about the PDF spec, the pipeline, or the edge cases that made me mass-delete objects at 2 AM.</p>]]></content>
        <author>
            <name>Napzoom</name>
            <uri>https://pdfcanon.com</uri>
        </author>
        <category label="pdf" term="pdf"/>
        <category label="sha256" term="sha256"/>
        <category label="normalization" term="normalization"/>
        <category label="security" term="security"/>
        <category label="compliance" term="compliance"/>
    </entry>
</feed>