PDF Page Order Bugs in HotPDF: Physical vs Logical Structure

The symptom showed up in a page-copying utility built on top of HotPDF Component: asking for page 1 of a three-page document consistently produced page 2. Checking the indexing logic found nothing wrong. The call was using a 0-based logical index, arithmetic was correct, boundary conditions were fine. Yet the wrong page came out every time

The bug was not in the copying code at all. It was in how HotPDF was building its internal page array when loading the file

Concept of PDF page order: difference between physical order and logical order — PDF page order: the /Kids array in the Pages tree defines logical sequence, independent of how objects are numbered or stored in the file

Two orderings, one source of confusion

A PDF file is a collection of indirect objects, each identified by an object number. The file structure imposes no obligation on those numbers to reflect reading order. Object 1 can hold page 2; object 20 can hold page 1. What actually defines the reading order is the page tree: a hierarchy of /Pages dictionaries whose /Kids arrays list page references in the sequence a viewer should display them (ISO 32000-1 §7.7.3)

The document triggering the bug had this page tree structure:

{ Pages tree root, object 16 }
16 0 obj
<<
  /Type /Pages
  /Count 3
  /Kids [20 0 R   { logical page 1 }
         1 0 R    { logical page 2 }
         4 0 R]   { logical page 3 }
>>
endobj

The file happened to list object 1 and object 4 before object 20 in the byte stream. Any parser that iterated through indirect objects in file order and stamped them into a PageArr as it found page-type dictionaries would end up with object 1 at index 0, object 4 at index 1, and object 20 at index 2. Logical page 1 sits at PageArr[2]. Asking for page index 0 fetches logical page 2 instead

That is exactly what both of HotPDF's internal parsing paths were doing. The traditional path, used for PDF 1.3/1.4 files, and the modern path, used for object-stream documents (PDF 1.5+), each built PageArr by walking indirect objects in physical file order rather than following the /Kids chain

Confirming the hypothesis

Before touching any fix, the mismatch needed to be proven rather than assumed. The qpdf command-line tool makes this straightforward:

{ shell }
qpdf --show-pages input.pdf
{ Output reveals Kids order: 20 0 R, then 1 0 R, then 4 0 R }

qpdf --show-object="16 0 R" input.pdf
{ Shows the Pages dictionary with /Kids in reading order }

Extracting each page individually and checking file sizes confirmed the mapping: what PageArr[0] produced was the content belonging to logical page 2, and PageArr[2] held logical page 1. The circular shift was the smoking gun. This also explained why the problem appeared across multiple different source documents: any PDF where page objects happened to have lower object numbers than an earlier logical page would trigger it

There is a straightforward reason PDFs end up in this state. Incremental saves append updated objects with new object numbers, leaving the old slots in the cross-reference table pointing nowhere. Editors that add a cover page insert it with a high object number regardless of its position in the Kids array. Some generators simply write pages in an order convenient for content streaming rather than logical page sequence. The PDF format does not require them to do otherwise

The fix: follow the Kids array

The correct approach is to build PageArr by walking the /Kids chain from the catalog root, not by scanning indirect objects. After both parsing paths complete their initial pass, a post-processing step resolves the logical order:

procedure THotPDF.ReorderPageArrByPagesTree;
var
  PagesObj  : THPDFDictionaryObject;
  KidsArray : THPDFArrayObject;
  NewPageArr: array of THPDFDictArrItem;
  I, J, PageIndex, KidsIndex: Integer;
  RefObj    : THPDFLink;
  PageObjNum: Integer;
  Found     : Boolean;
begin
  { Locate root /Pages dictionary via FRootIndex }
  PagesObj := FindPagesRootFromCatalog;
  if PagesObj = nil then Exit;

  KidsIndex := PagesObj.FindValue('Kids');
  if KidsIndex < 0 then Exit;
  KidsArray := THPDFArrayObject(PagesObj.GetIndexedItem(KidsIndex));

  SetLength(NewPageArr, KidsArray.Items.Count);
  PageIndex := 0;

  for I := 0 to KidsArray.Items.Count - 1 do
  begin
    RefObj     := THPDFLink(KidsArray.GetIndexedItem(I));
    PageObjNum := RefObj.Value.ObjectNumber;

    Found := False;
    for J := 0 to Length(PageArr) - 1 do
    begin
      if PageArr[J].PageLink.ObjectNumber = PageObjNum then
      begin
        NewPageArr[PageIndex] := PageArr[J];
        Inc(PageIndex);
        Found := True;
        Break;
      end;
    end;
    { Non-page Kids (intermediate /Pages nodes) produce no match; skip }
  end;

  if PageIndex > 0 then
  begin
    SetLength(PageArr, PageIndex);
    for I := 0 to PageIndex - 1 do
      PageArr[I] := NewPageArr[I];
  end;
end;

The call goes in at the end of each parsing path, after all objects have been catalogued but before any page operation is serviced:

{ Traditional path }
ListExtDictionary(THPDFDictionaryObject(IndirectObjects.Items[I]), FPageslink);
ReorderPageArrByPagesTree;
Break;

{ Modern path (object streams) }
if TryParseModernPDF then
begin
  Result := ModernPageCount;
  ReorderPageArrByPagesTree;
  Exit;
end;

The reorder step is O(n * m) where n is the Kids count and m is the current PageArr length, but for any document with a flat page tree (all leaves at depth 1, which covers the overwhelming majority of real-world PDFs) both are the same value and the cost is negligible. Deeply nested page trees require a recursive walk rather than the single-level approach shown here; the production implementation handles that case separately

Using CopyPageFromDocument after the fix

With ReorderPageArrByPagesTree in place, logical page indices work as expected. The higher-level CopyPageFromDocument takes a 0-based logical index and copies the correct page into the destination document:

var
  Source, Dest: THotPDF;
begin
  Source := THotPDF.Create(nil);
  Dest   := THotPDF.Create(nil);
  try
    Source.LoadFromFile('source.pdf');

    Dest.FileName := 'extracted.pdf';
    Dest.BeginDoc;

    { Copy logical page 0 (first page the user sees) }
    Dest.CopyPageFromDocument(Source, 0, 0);

    Dest.EndDoc;
  finally
    Source.Free;
    Dest.Free;
  end;
end;

CopyPageFromDocument internally queries the page tree order rather than relying on the raw PageArr index, so it behaves correctly even against documents where physical and logical order diverge. For batch operations, InsertPagesFromDocument accepts an array of logical indices and copies them in one pass

What this reveals about PDF parsing

The PDF specification is explicit: logical page order is defined by the /Kids array of the page tree, not by object numbers or byte offsets (ISO 32000-1 §7.7.3.2). Any parser that uses a different ordering as a shortcut will produce correct results on the majority of documents it sees, because most generators write pages in the natural order and assign sequential object numbers. The bug hides until someone loads a PDF that was incrementally edited, reorganized by another tool, or generated by software that chose a different layout

Testing only against self-generated PDFs misses this class of problem entirely. The fix for a page ordering regression therefore needs a corpus of documents from varied sources: incremental saves, scanned documents with inserted cover pages, PDFs produced by tools that linearize or optimize the object graph differently. A document that triggered the original bug should stay in the regression suite permanently

The HotPDF Component page covers the full API for page operations, including CopyPageFromDocument, InsertPagesFromDocument, and MovePage