Technical Article

Fast PDF Merge in Delphi: Byte-Level Ref Shifting

Concatenating PDFs sounds like it should be cheap. The page content is already laid out, the fonts are already embedded, the images are already compressed. In principle a merge is just bookkeeping: renumber the objects so two files' numbering spaces stop colliding, stitch the page trees together, fix up the cross-reference table, and write. In practice most merge code throws that cheapness away. For every object in every input file it runs a full parse into a tokenized object tree, mutates a couple of indirect references, then serializes the tree back to bytes. The parse and the reserialize are the expensive halves, and for the vast majority of objects they produce a byte sequence almost identical to what went in.

PDFlibPas is a native Object Pascal PDF engine for Delphi and C++Builder, and its fast merge path exists to skip that round trip wherever it is provably safe. The idea is narrow but it pays off across whole document sets: for an unmodified non-stream object, take the original source bytes verbatim and do a single byte-level rewrite of the indirect references they contain, turning every N G R into (N+Offset) G R. No tokenizer, no object tree, no serializer. This article walks through where that shortcut is legal, the parser state machine that performs the byte rewrite without corrupting anything, why bookmark merging needed a different mechanism entirely, and how the ordinary merge path was rebuilt from quadratic to linear at the same time.

Why object renumbering is the real cost of a merge

Every PDF carries its own object numbering space. File A has object 1, object 2, and so on; file B has its own object 1, object 2, and so on. You cannot drop B's objects into A's file unchanged, because the numbers would collide and every indirect reference inside B would now resolve to the wrong object. The fix is an offset: if A ends at object count Offset, then B's object N becomes object N+Offset in the output, and every reference N G R that appears anywhere inside B's objects must be shifted to (N+Offset) G R to match.

That shift is the entire semantic job of merging the body. The page tree fixups and the AcroForm merge are small, bounded edits on a handful of objects. The bulk work is rewriting references across thousands of objects, and the naive way to do it is to parse each object so you can find the references structurally. PDFlibPas's MergeFileListFast takes the opposite view: the references are findable in the raw bytes too, if you are careful about the contexts where a digit-space-digit-space-R sequence is not a reference. Skip the parse, shift in place, and the per-object cost collapses to a single linear scan of bytes you were going to copy anyway.

When reusing source bytes is provably safe

The byte path is only taken when three conditions all hold for the object being copied out of a following document. Any one of them failing sends the object back through the full decode-and-reserialize route, so correctness always wins over speed:

  • Doc2.IsChangedObject(X) is False. If the merge engine already mutated the object in memory (a page object whose /Parent was repointed, for instance), the in-memory tree is the source of truth and the original bytes are stale. Only untouched objects qualify.
  • The source bytes contain no stream keyword. A stream object's body is opaque binary framed by stream/endstream, and a naive reference scan over compressed or encrypted stream data would happily "find" and corrupt byte patterns that look like references. Stream objects keep the original stream-aware path.
  • The source bytes contain neither /StructTreeRoot nor /StructElem. In the fast profile the tagged-PDF structure tree is dropped rather than merged, so those objects must go through the decode path where the engine can null them out deliberately.

The decision lives in the per-object copy loop. When all three checks pass, the object's bytes go straight to ShiftIndRefsInSource and then to the writer; otherwise the bytes are discarded and the object is rebuilt with GetObject, shifted with ShiftIndRef, and serialized. The structure of that branch is worth seeing, because the order of the checks is what keeps it safe:

ObjectData := '';
if not Doc2.IsChangedObject(X) then
begin
  ObjectData := FastMergeObjectSource(Reader2, X);
  if (PLPos('stream', ObjectData) > 0) or
     ((not PreserveStructTree) and (PLPos('/StructTreeRoot', ObjectData) > 0)) or
     ((not PreserveStructTree) and (PLPos('/StructElem', ObjectData) > 0)) then
    ObjectData := ''                                  // fall back to decode
  else
    ObjectData := ShiftIndRefsInSource(ObjectData, Offset);
end;

if ObjectData <> '' then
  Writer.AddObject(X + Offset, Doc2.GetGenNum(X), ObjectData)
else
begin
  Obj := Doc2.GetObject(X, TempStruct);              // full parse path
  // ... null out struct-tree objects, ShiftIndRef, Obj.Output ...
end;

An empty ObjectData is the signal that the byte path declined the object. That single sentinel keeps the fast and slow routes from drifting apart: there is exactly one place that decides, and exactly one fallback.

The reference-shifting state machine and its edge cases

A byte rewrite of indirect references is deceptively easy to get wrong, because R and runs of digits appear all over a PDF object in contexts that are not references. ShiftIndRefsInSource is a small hand-written scanner that walks the bytes once and only rewrites a number when it is followed, with PDF whitespace between the tokens, by another number and then an R delimiter. The cheap exits come first: if the offset is zero or the source is empty, the bytes are returned untouched without entering the scanner at all.

The scanner's correctness rests on recognizing the contexts where a reference-shaped sequence must be left alone. These are the boundaries that are easiest to miss, and each one is handled explicitly:

  • Literal strings delimited by ( and ) are copied verbatim, tracking nesting depth and honoring the backslash escape so that an escaped parenthesis does not throw the depth count off. A string like (see object 3 0 R for details) contains a textbook reference pattern that is really just prose, and it must survive byte-for-byte.
  • Hexadecimal strings delimited by < and > are passed through without interpretation. The bytes 52 inside a hex string are the ASCII code for R, and a scanner that treated hex payload as text could manufacture a phantom reference. The opening << of a dictionary is detected first so a dictionary is not mistaken for a hex string.
  • Name objects starting with / are consumed whole, from the slash up to the next whitespace or delimiter. Without this, a name like /R (a common resource key) could be read as the R of a reference.
  • Comments introduced by % run to end of line and are skipped as opaque text.
  • The number-then-R test is strict. A reference is only recognized as N whitespace G whitespace R with the R terminated by whitespace, a delimiter, or end of input. If the generation number is missing, or an R is followed by a letter, the digits are emitted unchanged. This is what protects the integer in /Length 1234 and the four numbers of a MediaBox from being silently incremented.

The heart of that strict test reads almost exactly like the specification sentence describes it:

if (P <= N) and (Source[P] = 'R') and
   ((P = N) or PLIsPdfWhite(Source[P + 1]) or PLIsPdfDelimiter(Source[P + 1])) then
  Obj1 := PLStrToIntDef(PLCopy(Source, I, E1 - I), -1);

if Obj1 >= 0 then
begin
  AppendStr(PLIntToStr(Obj1 + Offset));   // shifted object number
  AppendBytes(E1, P - E1);                 // original whitespace + generation
  AppendBytes(P, 1);                       // the 'R'
end;

Only the object number is rewritten; the generation number and the exact original whitespace between tokens are copied through, so the output is byte-identical to the input except for the one integer that had to change. That precision is the whole point — it is what makes reusing source bytes equivalent to a full reserialize, not merely close to it. The behavior is covered by a focused set of unit tests exercising bare references, references inside arrays, non-reference numbers, literal strings, hex strings, and non-zero generation numbers with an offset applied.

Why bookmarks could not reuse AppendOutline

Merging multiple documents' bookmarks into one outline tree looks like a job for the existing AppendOutline helper, which already knows how to graft one document's top-level bookmarks onto another's. It is the wrong tool here, and the reason is a subtle layering mismatch. AppendOutline locates the current last top-level bookmark by walking the reader over the original file bytes. But the fast merge stages its edits in a new-objects buffer through ChangeObject; the reader never sees those edits. Chain three or more documents and every append re-points the first document's original last bookmark at the newest document, so all the intermediate documents' bookmarks fall out of the chain — only the cumulative /Count stays right, which makes the bug easy to miss until someone opens the bookmark panel.

The fast path solves it with a two-phase, metadata-driven injection that never re-walks the reader. A first pass over all inputs collects, per document, the outline root object and generation numbers, the first and last top-level bookmark numbers, and the root's /Count. From that summary the code computes the global object numbers of every link it needs to forge — each document's top-level /Parent to the shared root, the first bookmark's /Prev to the previous document's last, the last bookmark's /Next to the next document's first — using pure object-number arithmetic. There is a write-ordering constraint behind this: the first document's objects are written out before any following document is even opened, so all of the first document's outline edits (root /Count and /Last, and the old last bookmark's /Next) have to be expressible as arithmetic that needs no later document in hand. Each following document's edits are applied in place after it is opened but before it is written, so they ride out through the same change-object path.

The offset-alignment invariant that ties it together

Both the reference shift and the bookmark injection depend on one arithmetic invariant, and it is the most fragile assumption in the whole design. A reference injected into a following document is written as target global object number minus that document's Offset, so that when the object is later shifted by ShiftIndRef(Offset) the value lands on the intended global number. The first document gets Offset = 0 and uses global numbers directly. For that subtraction to be correct, the running offset sequence used during injection has to match the offset sequence used when objects are finally written out.

It does, because of a property of how the page and form merges work: AddPages, AddFields, and AddFieldFonts only modify the first document's existing objects — they never add new ones. So the first document's object count is unchanged across the page-merge stage, and each following document's offset (the sum of all preceding documents' object counts) stays stable from injection through write-out. Break that — introduce a stage that creates a new object mid-merge — and every page and bookmark reference downstream would be off by the count of objects you added. The invariant is quiet, but it is load-bearing.

Three entry points over one engine

The fast path is not a fork of the merge code. In the same line of work, the byte-level engine was factored into a single internal routine, MergeFileListInternal(ListName, OutputFileName, PreserveStructTree, StrictMode), and the public APIs became thin wrappers that choose two flags:

  • MergeFileListFast calls the engine with structure-tree preservation off — the leanest path, dropping the tagged-PDF tree so the byte route applies to the most objects.
  • MergeFileList calls it with preservation on, so the structure tree survives and the result stays a usable tagged PDF. This ordinary path also inherits the multi-document bookmark and form merging.
  • MergeFileListStrict turns on strict mode: the first metadata pass stops at the first input that does not report a clean merge, so only the documents collected before the bad file are included, rather than skipping the bad file and continuing.

Folding the paths together also let the ordinary merge be rebuilt from a pairwise O(N²) loop — merge file one and two, merge that result with three, and so on, reparsing the growing accumulator every step — into a single linear pass that opens each input once. The two long-standing two-file and two-stream entry points, MergeFiles and MergeStreams, are untouched and remain available for callers that genuinely want a pairwise merge.

One honest note on the structure-tree behavior, because it bit the test suite. The fast path's "drop" is not total: it removes the first document's catalog reference to /StructTreeRoot, but the structure-tree object itself still gets written out as an orphan. So the fast output's bytes still contain the /StructTreeRoot string, and you cannot distinguish fast from ordinary output by searching for that string — the real difference is whether the catalog still reaches the structure tree, which is what determines whether the file is still a navigable tagged PDF.

When to reach for which path

The byte path is a throughput optimization for assembling many documents where you do not need the tagged-PDF structure tree preserved — report bundling, statement runs, batch concatenation. Measured over repeated merges of medium-to-large input sets, the byte reuse trimmed roughly four to thirteen percent off wall-clock time depending on object mix, with no new failures on small or malformed inputs, because any object the scanner cannot prove safe falls back to the full parse. If you do need the structure tree intact for accessibility, use the ordinary tagged-PDF merge path, which preserves it; and if you are working with very large single files rather than many inputs, the byte-copy techniques described in the companion piece on large PDF merge and split with direct file access apply the same "copy bytes, avoid the full object tree" philosophy at file scale.

The merge routines and their fast and strict variants are part of the PDFlibPas Delphi PDF Library, whose documentation carries the full reference for the file-list API and the merge options described here.