Bulk Processing

This guide covers best practices for processing large volumes of PDFs with PDFCanon.

Concurrency model

PDFCanon processes jobs in parallel on its worker fleet. From the client side, you can submit multiple requests concurrently. Recommended concurrency limits by tier:

Tier	Recommended max concurrent requests
Starter	5
Growth	20
Pro	50

Exceeding these limits may result in 429 Too Many Requests responses. Implement backoff and retry logic.

Async submission pattern

For bulk workloads, use the async mode to avoid holding open HTTP connections:

const { default: PQueue } = await import('p-queue');
const queue = new PQueue({ concurrency: 10 });

const submissionIds = await Promise.all(
  pdfFiles.map(file =>
    queue.add(() => submitAsync(file))
  )
);

// Poll or use webhooks for completion

Using idempotency keys

Always use idempotency keys for bulk processing to safely retry on transient failures:

async function normalizeWithRetry(file, maxRetries = 3) {
  const key = `bulk-migration-${file.id}-v1`;
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await normalize(file, { 'Idempotency-Key': key });
    } catch (err) {
      if (attempt === maxRetries - 1 || err.status === 422) throw err;
      await sleep(2 ** attempt * 500);
    }
  }
}

Migration scripts

When migrating a large corpus of existing PDFs, structure your migration as:

Inventory — List all PDFs with their sizes and metadata
Batch — Group into batches of 100–500 documents
Submit — Submit each batch with idempotency keys
Verify — Compare output hashes to detect duplicates
Reconcile — Handle failures and resubmit with new idempotency key version

Rate limiting and backoff

Implement exponential backoff with jitter for 429 and 5xx responses:

import time, random

def backoff_delay(attempt):
    base = 0.5 * (2 ** attempt)
    jitter = random.uniform(0, base * 0.1)
    return min(base + jitter, 30)

Next steps

Batch API — Group submissions into trackable batches
Idempotency — Idempotency key details
Error Handling — Failure taxonomy and retry strategy

Concurrency model​

Async submission pattern​

Using idempotency keys​

Migration scripts​

Rate limiting and backoff​

Next steps​