Processing Large PDFs in Delphi: HotPDF Direct File API

Counting the pages in a 1.4 GB scanned archive should be cheap. Call LoadFromFile on that file and it stops being cheap: HotPDF parses the cross-reference data and builds an in-memory object for every one of the document's several hundred thousand indirect objects, and a 32-bit worker hits the 2 GB address-space ceiling somewhere in the middle of that parse. The operation you wanted, the page count, never needed any of those objects. It needed the page tree and nothing else. That gap, between what a job asks for and what a full load delivers, is the whole reason the Direct File API exists

The Direct File API gives Delphi and C++Builder file-level access to a PDF: page counts, copies, decryption, incremental appends, all reading from disk what they actually need rather than reconstructing the entire document model in RAM. The skill is matching each job to the lightest tier that can answer it. Get that match right and a service holds flat memory across any input size. Get it wrong and the first oversized file takes the worker down

What a full load costs you

LoadFromFile is not the enemy. It earns its memory: once the tree is in RAM you have random access to every page and every object, which is exactly what InsertPagesFromDocument, MovePage, and re-serialization through SaveLoadedDocument require. There is no shortcut for genuine restructuring; you have to hold the document to rearrange it

The trouble starts when input sizes are not yours to control. Customer uploads, scanner output, and archives from a decade ago ignore whatever your test corpus assumed. Load every input unconditionally and your memory ceiling is set by the single largest file anyone will ever submit. Parse time tracks object count, and resident memory settles at several times the file size after object structures and decoded streams are counted, so a gigabyte on disk can mean multiple gigabytes resident

Recompiling for 64-bit lifts the address-space ceiling but leaves the bill intact. The worker still burns seconds of CPU and a multiple of the file in RAM to answer a question the file's own structure could have answered in milliseconds. Under concurrency the math turns hostile: four large loads running at once share one memory budget, and throughput craters precisely when the queue is deepest and you can least afford it

Reading a file through a handle

The read-only tier opens a file as a handle, answers structural questions about it, and closes it. No object tree, no page rendering, no memory that grows with the input

var
  Pdf: THotPDF;
  Handle, PageCount: Integer;
begin
  Pdf := THotPDF.Create(nil);
  try
    Handle := Pdf.DAOpenFileReadOnly('archive-2026-06.pdf', '');
    if Handle > 0 then
    try
      PageCount := Pdf.DAGetPageCount(Handle);
      RouteByPageCount('archive-2026-06.pdf', PageCount);
    finally
      Pdf.DACloseFile(Handle);
    end;
  finally
    Pdf.Free;
  end;
end;

Three habits keep this tier honest. First, check the return value. A non-positive handle means the open failed, and firing DAGetPageCount at a dead handle is the kind of bug that stays hidden until the day a customer sends a malformed file. Second, pair every successful open with DACloseFile inside a finally block; a service that leaks handles does not crash, it just rots, which is worse. Third, respect what the password parameter actually does. DAOpenFileReadOnly accepts one, but for encrypted inputs it quietly drops to a full parse to read the page count, so the flat-memory guarantee evaporates. Route protected files through DecryptFile first and the rest of the pipeline stays cheap

The same probe doubles as a triage gate. Files show up mislabeled, half-uploaded, or renamed from some other format entirely, and a DAOpenFileReadOnly check rejects all of those at the front door in milliseconds, with the error pinned to the offending file. The alternative is letting a junk file ride deep into a queue worker and blow up there, where untangling which input caused it can cost an afternoon

Copy, decrypt, and encrypt whole files

The second tier moves and transforms complete files without ever exposing their internals. These are the calls intake pipelines lean on most

// Structural copy: validate-and-move without parsing the object tree
Status := Pdf.DACopyFile('incoming\statement.pdf', 'verified\statement.pdf');
LogDirectFileStatus('copy', Status);

// Decrypt while copying: the Direct File route into protected inputs
Status := Pdf.DecryptFile('incoming\protected.pdf',
  'verified\plain.pdf', 'batch-password');
LogDirectFileStatus('decrypt-copy', Status);

// Encrypt while copying: protect an output without a full load
Status := Pdf.EncryptFile('verified\statement.pdf',
  'outbound\statement.pdf', 'owner-secret', '', aes256, [prPrint]);
LogDirectFileStatus('encrypt-copy', Status);

Each call earns its place. DACopyFile is the validated copy from a quarantine directory into managed storage: it opens and indexes the PDF structure as it goes, so a truncated or non-PDF input fails right here rather than three stages downstream. DecryptFile writes a decrypted copy along a direct AES-256 rewrite path that skips the object tree whenever the input permits, the large-file counterpart to the load-and-resave decryption flow covered in the AES-256 encryption article. EncryptFile runs the same motion in reverse, applying password protection during a file-level copy with the key-type and permission parameters the in-memory path already uses

Appending changes instead of rewriting

Incremental update, defined in ISO 32000-1 §7.5.6, is the third tier. The original bytes stay where they are on disk, and any new or modified objects get appended after them, followed by a fresh cross-reference section that chains back into the original. For a 900 MB archive that needs a single page added, the write cost is the delta, not the whole file

// Append an audit page to a large archive without rewriting it
Pdf.BeginIncrementalUpdate('archive-2026-06.pdf');
Pdf.AddPage;
Pdf.CurrentPage.SetFont('Arial', [], 10);
Pdf.CurrentPage.TextOut(50, 760, 0, 'Processed by intake service 2026-06-11');
Pdf.SaveIncrementalUpdate('archive-2026-06-stamped.pdf');  // original bytes + delta

Two points of discipline matter here. BeginIncrementalUpdate has to point at the original file, since the appended cross-reference data chains back to byte offsets inside it. And the model is append-only by design: every incremental save grows the file, never shrinks it. A document stamped nightly will swell without limit until a periodic re-serialization, loading it and writing it back through SaveLoadedDocument, compacts it down. That same append-only nature is what makes incremental update the only safe way to touch a digitally signed document, a constraint examined in the digital signatures and PAdES article. The underlying cross-reference machinery gets its own treatment in the object streams and incremental updates article

There is a trap in append-only saves that slips past most reviews. The original bytes stay in the file, legible to anyone willing to look. An incremental update that "replaces" a page does not delete the old one; it supersedes it in the current revision while the previous revision sits there, fully recoverable. So incremental updates are the wrong tool for stripping sensitive content. To genuinely drop history a recipient should never see, you need a full re-serialization: LoadFromFile followed by SaveLoadedDocument, which writes out only the current state and leaves the buried revisions behind

Matching the tier to the operation

The selection logic is short enough to hold in your head, and it pays to encode it as an explicit routing decision at the top of a pipeline instead of letting each job improvise its own path. The operation you need decides the tier:

Count, inspect, or classify opens a handle: DAOpenFileReadOnly, DAGetPageCount, DACloseFile
Moving, decrypting, or encrypting a whole file stays at file level with DACopyFile, DecryptFile, or EncryptFile
Restructuring pages or merging documents needs the full load: LoadFromFile, then InsertPagesFromDocument or MovePage, then SaveLoadedDocument
Adding a small delta to a huge or signed file calls BeginIncrementalUpdate and saves

Mixed pipelines do well to put a size threshold in front of the full-load path. Send anything past a few hundred megabytes through the Direct File tiers, and reserve the full load for genuine restructuring on a 64-bit worker with a real memory budget. The threshold converts an out-of-memory crash into a routing decision you can see and tune

Whatever tier handles a job, write its output to a temporary name and rename into place only once the result validates. A half-written file sitting under the final name looks exactly like a good one to the next stage of the pipeline, and the Direct File calls make the check cheap: confirming an output is a one-line handle probe

The Direct File API ships as part of the HotPDF Component for Delphi and C++Builder. The product page links the full function reference, including the incremental-update calls shown here

Processing Large PDFs in Delphi with the HotPDF Direct File API

What a full load costs you

Reading a file through a handle

Copy, decrypt, and encrypt whole files

Appending changes instead of rewriting

Matching the tier to the operation