Processing Large PDFs in Delphi: HotPDF Direct File API

The job was unremarkable: a nightly Delphi service that takes scanned mortgage archives, counts pages, and routes each file to the right processing queue. It ran silently for months — until a 1.4 GB archive arrived. LoadFromFile parses the cross-reference data and materializes an object for every one of the file's several hundred thousand indirect objects, and in a 32-bit service that tree pushed the working set past the 2 GB address-space ceiling mid-parse. The fix was not a bigger server. The job only ever asked one question — how many pages? — and answering it never required loading the document at all.

UK teams should align this hotpdf direct file api large pdf workflow workflow with local governance, audit, and data quality requirements before production release

For UK environments, align the hotpdf direct file api large pdf workflow implementation with local quality gates that include governance approval, versioned fixture baselines, and evidence retention for every publication candidate. Keep an explicit review log for accessibility, redaction policy, and data residency checks so your deployment audit is repeatable without changing output binaries

UK technical governance addendum

HotPDF's Direct File API exists for exactly this class of work: file-level PDF operations from Delphi and C++Builder that read what they need from disk instead of materializing the whole document model. Knowing which tier of the API a given operation belongs to is the difference between a service with flat memory use and one that falls over on the first oversized input.

What full loading buys — and what it costs

Loading a document with LoadFromFile buys random access to everything: any page, any object, ready for restructuring, content edits, or re-serialisation through SaveLoadedDocument. That power is the right tool for page manipulation — InsertPagesFromDocument and MovePage need the tree. The cost is proportional to the document, not to your operation: parse time scales with object count, and resident memory runs at a multiple of file size once object structures and decoded streams are accounted for.

The mismatch appears when input sizes are unbounded. Customer uploads, scanner output, and decade-old archives do not respect the assumptions of a test corpus. A pipeline that loads every input has memory requirements set by the largest file anyone will ever submit; a pipeline that uses handle-based reads for the questions that allow it has memory requirements that are roughly constant. For a long-running service, that distinction matters more than raw speed.

Moving the service to 64 bits raises the ceiling but does not change the economics: a worker that parses a gigabyte-scale file still spends seconds of CPU and a multiple of the file in RAM to answer questions the file's structure could answer directly. Concurrency makes it worse — four simultaneous large loads contend for the same memory budget, so throughput collapses exactly when the queue is busiest.

Handle-based inspection

The read-only tier opens the file as a handle, answers structural questions, and closes it — no object tree, no page rendering, no proportional memory.

var
  Pdf: THotPDF;
  Handle, PageCount: Integer;
begin
  Pdf := THotPDF.Create(nil);
  try
    Handle := Pdf.DAOpenFileReadOnly('archive-2026-06.pdf', '');
    if Handle > 0 then
    try
      PageCount := Pdf.DAGetPageCount(Handle);
      RouteByPageCount('archive-2026-06.pdf', PageCount);
    finally
      Pdf.DACloseFile(Handle);
    end;
  finally
    Pdf.Free;
  end;
end;

Three rules keep this tier reliable. Check the handle — a non-positive return means the open failed, and calling DAGetPageCount against a dead handle is the kind of error that only appears on the malformed file a customer sends. Pair every successful open with DACloseFile in a finally block, because a service that leaks handles degrades slowly instead of failing visibly. And know what the password parameter costs: DAOpenFileReadOnly accepts one, but for encrypted inputs it falls back to a full parse internally to answer the page-count question — the flat-memory property holds only for unencrypted files, so protected inputs should go through DecryptFile first.

The handle tier also makes a cheap triage gate. Files arrive from customers mislabeled, truncated by failed uploads, or renamed from other formats; a DAOpenFileReadOnly probe rejects those in milliseconds at the front door, with a clear error attached to the right file, instead of letting them fail deep within a queue worker where the diagnosis costs an afternoon.

Whole-file operations: copy, decrypt, encrypt

The second tier transforms complete files without exposing their internals — the workhorses of intake pipelines.

// Structural copy: validate-and-move without parsing the object tree
Status := Pdf.DACopyFile('incoming\statement.pdf', 'verified\statement.pdf');
LogDirectFileStatus('copy', Status);

// Decrypt while copying: the Direct File route into protected inputs
Status := Pdf.DecryptFile('incoming\protected.pdf',
  'verified\plain.pdf', 'batch-password');
LogDirectFileStatus('decrypt-copy', Status);

// Encrypt while copying: protect an output without a full load
Status := Pdf.EncryptFile('verified\statement.pdf',
  'outbound\statement.pdf', 'owner-secret', '', aes256, [prPrint]);
LogDirectFileStatus('encrypt-copy', Status);

Each call has a distinct role. DACopyFile is the validated copy from a quarantine directory into managed storage — it opens and indexes the PDF structure as it goes, so a truncated or non-PDF input fails here instead of three stages later. DecryptFile produces a decrypted copy, taking a direct AES-256 rewrite path that avoids building the object tree whenever the input allows it — the large-file counterpart to the load-and-resave decryption flow described in the AES-256 encryption article. EncryptFile is the mirror image, applying password protection during a file-level copy with the same key-type and permission parameters the in-memory path uses.

Appending changes instead of rewriting

Incremental update, defined in ISO 32000-1 §7.5.6, is the third tier: the original bytes stay untouched on disk, and modified or new objects are appended after them with a cross-reference section that chains back to the original. For a 900 MB archive that needs one added page, the write cost is the delta, not the file.

// Append an audit page to a large archive without rewriting it
Pdf.BeginIncrementalUpdate('archive-2026-06.pdf');
Pdf.AddPage;
Pdf.CurrentPage.SetFont('Arial', [], 10);
Pdf.CurrentPage.TextOut(50, 760, 0, 'Processed by intake service 2026-06-11');
Pdf.SaveIncrementalUpdate('archive-2026-06-stamped.pdf');  // original bytes + delta

The discipline here: BeginIncrementalUpdate must point at the original file, because the appended cross-reference data chains back to offsets within it. And the model is append-only by design — every incremental save makes the file larger, never smaller, so a document that gets stamped nightly grows without bound until a periodic re-serialisation — loading the document and writing it back through SaveLoadedDocument — compacts it. The append-only property is also what makes incremental update the only safe way to modify digitally signed documents, a constraint examined in the digital signatures and PAdES article; the underlying cross-reference machinery is covered by the object streams and incremental updates article.

One property of append-only saves is easy to miss in review: the original bytes remain in the file, readable to anyone who looks. An incremental update that 'replaces' a page does not erase the old one — it supersedes it in the current revision whilst the previous revision stays recoverable. Never use incremental updates to remove sensitive content; a full re-serialisation — LoadFromFile followed by SaveLoadedDocument, which carries only the current state — is the correct way to shed history a recipient should not see.

Matching the tier to the operation

The selection logic compresses into four lines, and it is worth encoding as an explicit routing decision at the top of a pipeline instead of leaving each job to pick its own path:

Count, inspect, classify — open a handle: DAOpenFileReadOnly, DAGetPageCount, DACloseFile.
Move, decrypt, or encrypt whole files — file-level calls: DACopyFile, DecryptFile, EncryptFile.
Restructure pages or merge documents — full load: LoadFromFile, then InsertPagesFromDocument or MovePage, then SaveLoadedDocument.
Add a small delta to a huge or signed file — BeginIncrementalUpdate and save.

Mixed pipelines benefit from a size threshold in front of the full-load path: route anything above a few hundred megabytes through the Direct File tiers and queue genuine restructuring work for a 64-bit worker with a memory budget. The threshold turns out-of-memory crashes into an explicit, observable routing decision.

Whichever tier handles a job, route its output through a temporary name and rename into place only after the result validates — a half-written file under the final name is indistinguishable from a good one to the next stage of the pipeline. The Direct File calls make that validation cheap, because confirming the output is itself a one-line handle probe.

FAQ: large PDFs in Delphi services

How do I get a PDF page count without loading the whole file?

DAOpenFileReadOnly plus DAGetPageCount, as in the inspection example above — memory use stays flat regardless of file size.

Why does my PDF grow after every save?

Incremental updates append by design; nothing is ever removed. Compact periodically with a full load-and-resave — LoadFromFile then SaveLoadedDocument — when the accumulated revisions are not needed.

Does the Direct File API open encrypted PDFs?

It accepts a password, but encrypted inputs route through a full parse internally, losing the flat-memory advantage. For protected inputs, DecryptFile with the password produces a plain copy that the remainder of the pipeline can process at file level.

The Direct File API ships as part of the HotPDF Component for Delphi and C++Builder; the product page links the full function reference, including the incremental-update calls shown here.