PDF Page Trees: Why Page Order Is Not Object Order

Page 1 of a PDF is not object 1. That distinction is the single most common source of wrong-page-extracted bugs in PDF parsers, and the fix is reading the spec rather than the file bytes.

Objects, references, and the catalog

A PDF file is a collection of numbered objects. Each one carries a unique object number and a generation number, written as N G obj where G is almost always 0 in files that have not been incrementally updated. Objects reference each other with the notation N G R, so 3 0 R means "the current version of object 3." The trailer points to a root catalog object whose /Pages entry leads to the page tree. Everything navigable in a PDF starts from that root, not from the first byte of the file body.

The cross-reference table (or cross-reference stream in PDF 1.5+) maps object numbers to file offsets. Its job is random access, not ordering. A writer that builds a document incrementally can append new objects at the end with higher numbers while those objects logically precede existing ones in the page sequence. That is not a defect; it is by design.

The page tree (ISO 32000-1 §7.7.3)

Page sequence lives in the page tree. The root catalog contains a /Pages reference that points to a node of type /Pages. That node's /Kids array lists its children in reading order. Each child is either a leaf node of type /Page or another intermediate /Pages node containing its own /Kids. Page 1 is the first leaf reached by a depth-first, left-to-right traversal of the Kids arrays. The /Count entry on each intermediate node caches the total number of descendant leaf pages, so a viewer can jump to page 500 without walking the entire tree.

Here is what a minimal three-page tree looks like in raw PDF syntax:

16 0 obj
<<
  /Type /Pages
  /Count 3
  /Kids [20 0 R  1 0 R  4 0 R]
  /MediaBox [0 0 612 792]
>>
endobj

20 0 obj
<< /Type /Page  /Parent 16 0 R  /Contents 21 0 R  /Resources 22 0 R >>
endobj

1 0 obj
<< /Type /Page  /Parent 16 0 R  /Contents 2 0 R   /Resources 3 0 R >>
endobj

4 0 obj
<< /Type /Page  /Parent 16 0 R  /Contents 5 0 R   /Resources 6 0 R >>
endobj

The Kids array reads [20 0 R, 1 0 R, 4 0 R]. Logical page 1 is object 20, logical page 2 is object 1, logical page 3 is object 4. Any code that iterates object numbers from 1 upward will encounter them in the order 1, 4, 20 and produce the sequence page-2, page-3, page-1. The resulting document renders in a shuffled order that can look perfectly normal in a viewer that follows the tree, and catastrophically wrong in one that does not.

Inheritance

Intermediate nodes can carry properties that their descendants inherit. The most common inherited entries are /MediaBox (page dimensions), /CropBox, /Resources (fonts and images), and /Rotate. A leaf page that omits /MediaBox is not broken; it picks up the value from the nearest ancestor node that defines it. A page that does define /MediaBox overrides whatever the parent says, for that page only.

This matters for parsing. Reading a /Page object in isolation and assuming its properties are complete will misreport dimensions for any page that relies on inheritance. A correct reader walks the /Parent chain, collecting properties it has not yet seen, stopping at the root.

Nested trees

Nothing in the spec limits the tree to a single level. A large document might group pages under intermediate nodes that loosely correspond to chapters:

2 0 obj   % root Pages node, Count = 8
<< /Type /Pages  /Count 8  /Kids [3 0 R  4 0 R] >>
endobj

3 0 obj   % first chapter, 5 pages
<< /Type /Pages  /Parent 2 0 R  /Count 5
   /Kids [10 0 R  11 0 R  12 0 R  13 0 R  14 0 R]
   /MediaBox [0 0 612 792] >>
endobj

4 0 obj   % second chapter, 3 pages
<< /Type /Pages  /Parent 2 0 R  /Count 3
   /Kids [20 0 R  21 0 R  22 0 R]
   /MediaBox [0 0 612 792] >>
endobj

The traversal algorithm is the same: visit Kids in order, recurse into any /Pages node, collect leaf /Page nodes. The /Count values allow a viewer to skip an entire subtree when jumping to a page that lies beyond it, which is why those counts must be accurate. Some PDF editors from the late 1990s and early 2000s did not recompute them after in-place edits, so a defensive parser verifies /Count against the actual leaf count rather than trusting it for array allocation.

Where this comes up in practice

The page-order bug surfaces most often in two scenarios. The first is a custom parser that scans for objects of type /Page rather than following the tree. It finds every page, but in object-number order, not reading order. The fix is always the same: start from the trailer, resolve the root catalog, follow /Pages, and traverse Kids arrays.

The second scenario is an incremental-update file. When a PDF editor appends changes without rewriting the whole file, new page objects get high object numbers while the Kids array in the original tree still controls their logical position. A page that was originally object 5 gets replaced by a new object 143, but the Kids array now references 143 where it used to reference 5, so logical order is preserved. Walking by object number would put the replacement page at the wrong position in the sequence.

Linearized (web-optimized) PDFs add a third variation: the file is physically rearranged so that the first page's content appears near the start of the file for fast display over a slow connection. The page tree structure remains authoritative for order, but the cross-reference table maps to the rearranged offsets. A parser that relies on file position rather than the xref table will misread even the first page of a linearized file.

The HotPDF Component handles page tree traversal, inheritance resolution, and incremental-update xref merging internally. Working with its page objects directly means the Kids-array ordering is already applied; page indices map to logical pages, not object numbers.