Technical Article

PDF File Structure: Header, Body, Xref, and Trailer

A PDF reader does not start at the beginning of the file. It starts at the end. The last few bytes hold the address of everything else, and a parser that does not understand that order will misread the format from the first line. So the most useful way to learn PDF on disk is to learn it the way a reader does: tail first, then jump backward to the map, then resolve the objects the map points at.

The bytes themselves are plain enough to read in a text editor when nothing is compressed. A minimal one-page document that draws "Hello, World!" fits in under five hundred bytes, and every structural element of the format is visible in it. Here is the whole file, with the four parts marked:

%PDF-1.0                          % Header
%âãÏÓ

1 0 obj                           % Body: the object sequence
<<
/Kids [2 0 R]
/Count 1
/Type /Pages
>>
endobj

2 0 obj
<<
/Rotate 0
/Parent 1 0 R
/Resources 3 0 R
/MediaBox [0 0 612 792]
/Contents [4 0 R]
/Type /Page
>>
endobj

3 0 obj
<< /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >> >> >>
endobj

4 0 obj
<< /Length 65 >>
stream
1. 0. 0. 1. 50. 700. cm BT
  /F0 36. Tf
  (Hello, World!) Tj
ET
endstream
endobj

5 0 obj
<< /Pages 1 0 R /Type /Catalog >>
endobj

xref                              % Cross-reference table
0 6
0000000000 65535 f
0000000015 00000 n
0000000074 00000 n
0000000192 00000 n
0000000291 00000 n
0000000409 00000 n

trailer                           % Trailer
<<
/Root 5 0 R
/Size 6
>>
startxref
459
%%EOF

Four parts, always in this order down the file: a header, a body of objects, a cross-reference table, and a trailer. The catch is that you read them in almost the reverse order. ISO 32000-2 §7.5.1 lays out the same four-part anatomy, and the reason for the back-to-front access is purely practical: a reader that jumps straight to the object it needs is much faster than one that scans every byte from the top, and that random access is exactly what the trailer and cross-reference table exist to provide.

The header is two lines, and the second one matters

The first line is %PDF-1.0. The percent sign makes it a comment as far as the syntax is concerned, but readers treat it as the file signature and pull the version number out of it. Version handling is loose in practice. A reader built for PDF 2.0 will happily open a file that claims 1.0, and most readers will attempt a file whose declared version is wrong or whose version line is buried a little way into the file rather than at byte zero. The number is a hint about which features to expect, not a gate.

The second line is the one people delete by accident and then spend an afternoon debugging. It is a comment too, but its payload is four bytes above ASCII 127. They exist so that anything moving the file in "text mode" recognizes it as binary and stops rewriting line endings. A PDF carries compressed streams whose bytes can match a carriage return or line feed by coincidence; if a transfer tool rewrites those, the stream length recorded in the dictionary no longer matches the bytes on disk and the file is corrupt. The high-byte comment is a forty-year-old defense against FTP in ASCII mode, and it is still in every file a serious tool writes because the failure it prevents is silent and total.

The body holds the objects, each one numbered

Everything that makes up the document lives in the body as a flat sequence of indirect objects. Each one opens with two integers and the obj keyword, holds its content, and closes with endobj. Object 1 in the sample above is the page-tree node: 1 0 obj, then a dictionary, then endobj. The first integer is the object number, the second is the generation number. Generation is almost always zero in a freshly written file; it climbs only when an object number gets reused across edits, which is rare enough that you can treat a nonzero generation as a sign the file has been through incremental updates. The content between the keywords is a dictionary here, written between << and >>, but it could just as well be a number, a string, an array, or a stream.

What makes this a graph rather than a list is the reference token 2 0 R. That means "object 2, generation 0, wherever it happens to live in the file." The page-tree node above does not contain its page; it points at object 2, which points at its resources and content stream by the same mechanism. The body is laid out in whatever order the writer found convenient, and the references stitch it into a tree rooted at the catalog. Position in the file carries no meaning. Identity comes from the object number, and location comes from the cross-reference table.

The cross-reference table is an index of byte offsets

The xref table is what turns object numbers into file positions. It is the reason a reader can open a thousand-page document and render page 850 without parsing the 849 pages before it. Each entry records exactly where its object begins, counted in bytes from the start of the file:

xref
0 6                  % 6 entries, starting at object 0
0000000000 65535 f   % entry 0: head of the free list
0000000015 00000 n   % object 1 begins at byte 15
0000000074 00000 n   % object 2 begins at byte 74
0000000192 00000 n   % object 3 begins at byte 192
0000000291 00000 n   % object 4 begins at byte 291
0000000409 00000 n   % object 5 begins at byte 409

The fixed width is deliberate. Every entry is exactly twenty bytes: a ten-digit offset, a space, a five-digit generation, a space, a one-character type, and a two-byte end-of-line. Because the rows are uniform, a reader can index straight to the entry for object n by arithmetic rather than scanning, so the table that gives random access to the body is itself randomly accessible. The 0 6 line is a subsection header: it says the next entries describe six objects starting at number 0.

Object 0 is special and always present. Its type is f for free, its generation is 65535, and it heads the linked list of free object numbers. In a file that has never been edited the free list is just this one entry, a formality. It earns its keep during incremental updates, when deleting an object adds its number to that list so a later edit can reclaim it. The other entries are type n for in-use, and their ten-digit number is the offset you would seek to in order to read that object's definition.

The trailer is the entry point, and it sits at the end

The trailer is the first thing a reader actually consumes, even though it is written last. A parser opens the file, seeks to the end, and walks backward looking for %%EOF. Just above it sits startxref followed by a single number, and that number is the byte offset of the xref keyword. With it the reader jumps directly to the cross-reference table without having scanned a single object:

trailer
<<
/Root 5 0 R          % the document catalog
/Size 6              % one more than the highest object number
>>
startxref
459                  % byte offset of the xref table
%%EOF

The trailer dictionary carries the two values a reader needs before it can do anything else. /Root points at the document catalog, object 5 here, which is the top of the object graph and the route to the page tree. /Size is the count of entries the cross-reference table should contain, which is one more than the highest object number because of the free entry at slot zero. From %%EOF the whole reading sequence falls out: find the marker, read startxref to locate the table, load the table to learn where every object lives, read /Root to find the catalog, and resolve objects on demand from there. The header, sitting at the top, is barely consulted until late. The map at the bottom is what the reader needs first.

Incremental update appends a second map instead of rewriting

That tail-first design pays off when a file changes. A PDF can be edited without rewriting any of the bytes already on disk. New and modified objects are appended to the end, followed by a fresh cross-reference section and a fresh trailer, and the original file underneath is left untouched. The one new piece of bookkeeping is a /Prev entry in the new trailer, holding the byte offset of the previous cross-reference table:

% ... original file, unchanged, ends here ...

6 0 obj                          % an object added by this edit
<< /Type /Annot /Subtype /Text /Rect [100 700 120 720] >>
endobj

xref                             % a second xref section, for the new object only
6 1
0000000612 00000 n

trailer
<<
/Root 5 0 R
/Size 7
/Prev 459                        % byte offset of the earlier xref table
>>
startxref
680                              % offset of this new xref section
%%EOF

A reader still starts at the final %%EOF, still follows startxref to the most recent table, but now follows the /Prev chain backward to older tables, merging them so that the newest entry for any object number wins. The cross-reference sections form a linked list down through the file, each one overriding the one before it for the objects it touches. An object that an edit replaced still physically exists at its old offset; it is simply no longer reachable, because a later xref entry points somewhere newer.

This is the mechanism that makes signed PDFs verifiable. A digital signature covers a byte range of the file, and because an incremental update only ever appends, the signed bytes never move. The signature still validates against the original range while later revisions sit beyond it, each with its own xref and trailer. It is also why a PDF can carry recoverable history: every superseded object is still on disk under an earlier cross-reference section, which is a feature for version tracking and a liability for anyone who thought "delete" meant the bytes were gone.

The cost is growth. Each edit appends; nothing is reclaimed in place, so a file revised many times accumulates dead objects and a long chain of xref sections. The remedy is a full rewrite: load the document and save it fresh, which renumbers the surviving objects, drops the unreachable ones, and emits a single clean cross-reference table. The two strategies trade against each other directly. Appending is fast and preserves signatures and history; rewriting is slower and discards both, in exchange for a compact file.

Reading the four parts in practice

Knowing the layout is enough to debug most "this file won't open" problems by hand. If a reader rejects a PDF, the usual culprits are at the two ends, not in the middle. A truncated download loses the trailer, so startxref or %%EOF is missing and the reader has no entry point; tolerant readers fall back to scanning the whole file to rebuild the xref, which is exactly the slow path the table was meant to avoid. A botched text-mode transfer corrupts stream bytes or the offsets stop matching reality, and objects load from the wrong position. When the offsets in the table no longer point at real obj keywords, the file is structurally broken even if every object is individually fine.

For new code, the lesson of the layout is to let a library own the byte bookkeeping. The offsets in the cross-reference table have to agree with the actual positions of every object to the byte, the trailer has to point at the right table, and incremental updates have to chain correctly through /Prev. A native component like the HotPDF Component for Delphi and C++Builder handles all of that when it writes a file, including the choice between appending an incremental revision and rewriting a compact one. If you want to see the same structure built up from nothing rather than dissected, the companion piece on building a PDF document from scratch walks through emitting the header, objects, xref, and trailer in order.