Technical Article

HotPDF Page Extraction Performance in Delphi

Two minutes to copy three pages out of a 40-page PDF is not a performance tuning problem. It is a signal that the wrong API path is being used. When I first saw this timing on a HotPDF Component page-copy sample, my instinct was to look at the document structure first and the code second. That order turned out to matter

What was actually slow

The PDF in question was a 40-page reference document with a non-trivial page tree: multiple intermediate /Pages nodes instead of a single flat array. The original sample code was calling LoadFromFile, then building a new document with BeginDoc, looping over selected page numbers, and on each iteration loading the source document again from disk to pull a page. That is the full parse cost multiplied by however many pages you want. A 12 MB file hit disk six times for a three-page extraction because nobody looked at whether the file needed to stay open across iterations

The second contributor was invisible in the code: HotPDF's LoadFromFile resolves the entire cross-reference table and decompresses every object stream on load. That is the right behavior for a document you are about to modify, but it is more work than you need if you only want page count and a subset of pages. For read-only access to structure, DAOpenFileReadOnly avoids deserializing the full object tree, which matters on compressed files with large image resources

Neither of these is a library bug. Both are callers choosing the API designed for one job and using it for a different one

Using InsertPagesFromDocument for page extraction

The right path for copying a range of pages from one HotPDF document into another is InsertPagesFromDocument, called after LoadFromFile on the source. You load the source once, load or create the destination once, move the pages, and save. The source stays in memory across all the page insertions:

procedure ExtractPages(const SourceFile, DestFile: string;
  const PageRange: string);
var
  Source, Dest: THotPDF;
begin
  Source := THotPDF.Create(nil);
  Dest   := THotPDF.Create(nil);
  try
    // Load source once: full parse happens here and only here
    Source.LoadFromFile(SourceFile);

    // Build a minimal destination document
    Dest.FileName := DestFile;
    Dest.BeginDoc;

    // Copy the requested range; '1-3' inserts pages 1 through 3
    // starting at position 1 in the destination
    Dest.InsertPagesFromDocument(Source, PageRange, 1);

    Dest.EndDoc;
  finally
    Source.Free;
    Dest.Free;
  end;
end;

The PageRange parameter accepts the same format as the command-line sample: a comma-separated list of page numbers or ranges such as '1-3' or '1,5,7-9'. Pages are 1-based. InsertPagesFromDocument copies content streams, resource dictionaries, and page geometry without touching metadata, bookmarks, or embedded file attachments unless they are referenced from the copied pages. For a three-page extraction from a 40-page document, that is a small working set

Timing on the same 12 MB file that previously ran for two minutes: under 1.5 seconds with this pattern. Most of that time is the single LoadFromFile call. The document structure is irrelevant once the object table is resolved the first time

When LoadFromFile is too much: the Direct File API

If you only need to count pages, inspect document info, or copy a file without touching its contents, the Direct File API avoids the full parse entirely. DAOpenFileReadOnly maps the cross-reference table without decompressing object streams, so page count is O(xref size) rather than O(file size):

procedure InspectPDF(const FileName: string);
var
  Pdf: THotPDF;
  Handle, PageCount: Integer;
begin
  Pdf := THotPDF.Create(nil);
  try
    Handle := Pdf.DAOpenFileReadOnly(FileName, '');
    if Handle <= 0 then
      Exit;
    try
      PageCount := Pdf.DAGetPageCount(Handle);
      Writeln('Pages: ', PageCount);

      // DACopyFile is a byte-preserving copy, no re-serialization
      Pdf.DACopyFile(FileName, 'archive-copy.pdf');
    finally
      Pdf.DACloseFile(Handle);
    end;
  finally
    Pdf.Free;
  end;
end;

The caveat: DAOpenFileReadOnly accepts a password parameter but falls back to a full parse for encrypted inputs, because decryption requires the object tree to resolve the encryption dictionary. If your source files are encrypted, decrypt them first with DecryptFile to get an unencrypted copy, then open that with the Direct File API. The file-level DecryptFile function takes a direct AES-256 rewrite path for standard encryption and is faster than LoadFromFile followed by SaveLoadedDocument for large files, because it does not build the full in-memory object model

Memory during large-batch processing

Batch jobs that process dozens of files in a loop have a pattern that looks correct but accumulates memory: creating THotPDF inside the loop, calling LoadFromFile, doing work, calling Free. That is structurally fine. The problem is when the inner work allocates scratch objects, catches exceptions, and leaves those scratch objects live on error paths. Delphi's memory manager does not compact, so a hundred error-path leaks across a batch run can push memory high enough to slow allocation for everything else

The fix is not exotic. Every THotPDF and every intermediate TStream or TBitmap that participates in PDF work belongs in a try/finally block where Free is the last statement. Set local pointers to nil before the try so the finally branch can use if Assigned(x) then x.Free safely when initialization fails partway through. This is standard Delphi ownership discipline and it is the full story for this class of problem

One more thing to check in batch contexts: AddImage registers images in an internal list that persists for the lifetime of the THotPDF instance. If you reuse a single instance across many documents by calling LoadFromFile repeatedly, image registrations from earlier documents stay in the list. Either create a fresh instance per document or call the image-list clear path between documents

Measuring before changing anything

Before reaching for any of these patterns, measure. Delphi's TStopwatch from System.Diagnostics wraps QueryPerformanceCounter and is accurate enough for wall-clock profiling of file I/O. Wrap LoadFromFile alone and see how long it accounts for. If it is 90% of the total time, the fix is the Direct File API or reducing how many times you parse the same file. If it is under 20%, the bottleneck is somewhere else and you are chasing the wrong thing

The two-minute extraction that started this post turned out to be entirely the repeated-load pattern. The document structure contributed nothing; a flat page tree would have run the same way. Switching to a single LoadFromFile followed by one InsertPagesFromDocument call brought it to 1.3 seconds on the same hardware without touching anything else

The page manipulation API shown here is part of the HotPDF Component for Delphi and C++Builder