Technical Article

Validating Compressed PDFs: Object and XRef Streams

You write a small validator. It opens a PDF, seeks to the end, finds startxref, reads the offset, and expects to land on the keyword xref with a fixed-width cross-reference table beneath it. From that table it collects object offsets, then scans backward for the trailer keyword to learn the /Root and /Size. It works perfectly on every file you generated to test it. Then a file produced by a current version of Word, or by a library that targets PDF 1.5, arrives, and the validator declares it broken. There is no xref keyword where the offset points, no trailer dictionary anywhere, and the object table the validator built is nearly empty. The file is valid. The validator is reading it through a fifteen-year-old lens

This is the single most common reason a byte-level PDF check written against the classic layout fails on modern documents. The structure it depends on, the plaintext cross-reference table and the trailer keyword, was made optional in PDF 1.5 and is frequently absent. Two features replaced it: the cross-reference stream and the compressed object stream. Both are described in ISO 32000-1, and a validator that does not know about them sees a healthy file as a heap of missing objects

What PDF 1.5 changed about the file tail

ISO 32000-1 §7.5.8 defines the cross-reference stream, and §7.5.7 defines the object stream of type /ObjStm. Together they let a writer drop the two structures a classic parser keys on. A PDF 1.5 file may end with no xref table at all. In its place, the object that startxref points to is an ordinary stream object whose dictionary carries /Type /XRef, and that stream holds the cross-reference data in a compact binary form. There is no trailer keyword either, because the trailer is now the stream's own dictionary. The keys a classic parser hunted for, /Root, /Size and /ID, live inside that dictionary

The second change moves the objects themselves. Instead of writing every indirect object at its own byte offset, a writer can pack many small objects, the page dictionaries, the annotation dictionaries, the structure tree, into a single object stream and compress the whole container with Flate. The individual objects no longer have a byte offset in the file. They have a position inside a compressed blob. A validator scanning the raw bytes for 1 0 obj never finds them, because that text only exists after inflation. To a classic parser, half the document has simply vanished

The trailer keys are plaintext, even in a compressed file

The reassuring part is that reading the trailer of a cross-reference stream does not require inflating anything. A stream object is written as a dictionary followed by the stream keyword and then the compressed bytes. The dictionary is plaintext. So when startxref points at a cross-reference stream, the bytes immediately after the object number look like an ordinary dictionary, and /Root, /Size and /ID sit there in the clear, before the stream keyword and the Flate data begin

That means a validator can learn the three facts it most needs, where the catalog is, how many objects the file claims, and the file identifier, by parsing only the stream dictionary. It does not have to decompress the cross-reference data, and it does not have to interpret the binary entries inside it. The work that defeats a naive parser is not reading the trailer; it is finding the objects. Those are two separable problems, and solving the first is cheap

Object streams: a header, then a Flate blob

An object stream is a container. Its dictionary carries /Type /ObjStm, an /N entry giving the number of objects packed inside, and a /First entry giving the byte offset, within the inflated data, where the first object's body starts. The compressed payload, once inflated, begins with a small header of /N integer pairs. Each pair is an object number and the offset of that object's body relative to /First. After the header come the object bodies themselves, concatenated

Expanding one is mechanical once the bytes are inflated. You read the dictionary to get /N and /First, inflate the stream with a Flate decoder, walk the leading /N pairs to learn which object number lives at which offset, and then lift each body out as if it were an ordinary indirect object. The only real dependency is the Flate decoder, and you already have one: Delphi ships System.ZLib, and Free Pascal ships the zstream unit, both of which wrap zlib and inflate a raw Flate stream without any third-party code. A routine that appends every extracted object to the validator's object table makes the rest of the validator, the part that walks /Root and checks the page tree, behave exactly as it would on a classic file

What you do not have to implement

It is easy to overestimate the work. Reading the trailer keys from a compressed file does not require decoding the cross-reference stream's binary entries. The §7.5.8 cross-reference stream uses three entry types, and the type 2 entry, the one that says this object lives inside object stream N at index i, is what you would decode to build a full offset map. You need that map to resolve arbitrary objects by number. You do not need it to read /Root, /Size and /ID, which are in the plaintext dictionary, and you do not need it to expand object streams, because each /ObjStm announces its own contents through /N and /First

You also do not have to handle the PNG and TIFF predictor functions that a cross-reference stream may apply through its /DecodeParms just to get the trailer keys. Predictors filter the binary cross-reference rows to make them compress better; they have nothing to do with the dictionary that precedes the stream. The minimal upgrade that makes a classic validator modern-PDF aware is therefore small: when startxref lands on a stream rather than the xref keyword, parse the stream dictionary for the trailer keys, and expand any /ObjStm objects you encounter so their contents enter the object table. Decoding type 2 entries and predictors is a separate, larger task you can defer until you genuinely need random object resolution

Why a compliance check must expand streams first

This stops being academic the moment you run a profile check. A PDF/A or PDF/X validator inspects specific objects: the document catalog for an /OutputIntents array, the /Metadata stream for an XMP packet with the right identifier, every font descriptor for an embedded font file, the trailer for an /ID. In a compressed file, most of those objects are inside object streams. A validator that has not expanded the object streams cannot see the catalog's keys, cannot find the metadata, and cannot enumerate the fonts. It will report a perfectly conformant document as missing its output intent, missing its XMP, and missing half its structure, because the evidence it needs is still sitting in a Flate blob it never inflated

The order matters. Expansion has to happen before the checks run, not alongside them, because every check assumes it can reach an object by number. If you wire a profile check directly onto a raw byte scan, it inherits the classic parser's blindness and produces false violations on exactly the modern files that are most likely to be well formed, since they came out of toolchains new enough to write cross-reference streams in the first place

Letting PDFium do the parsing for you

The PDFium Component parses cross-reference streams and object streams as part of loading a document, which is the practical way to avoid hand-rolling the inflate-and-expand step. When you load a file with the TPdf component, the objects packed into /ObjStm containers are already resolved, and the validation entry points see the fully expanded document. ValidatePdfA returns a TPdfAValidationResult record whose Conformance field is a TPdfAConformance value such as pac1b or pacNone, whose Issues field is a set of the specific problems found, and whose IsCompliant method is true only when a conformance level was detected and the issue set is empty. Because the objects were expanded during load, an /OutputIntents array or an embedded font that lived inside an object stream is found, not reported missing

uses
  PDFium, FPdfPdfa;

function CheckPdfA(const FileName: string): TPdfAValidationResult;
var
  Pdf: TPdf;
begin
  Pdf := TPdf.Create(nil);
  try
    Pdf.FileName := FileName;
    Pdf.Active := True;            // parses xref/object streams on load
    Result := Pdf.ValidatePdfA;    // sees the expanded object table
  finally
    Pdf.Free;
  end;
end;

The same applies to ValidatePdfX, which returns a TPdfXValidationResult with the same shape. The point of routing through PDFium is that the structural decompression described above happens once, correctly, inside the loader, so your validation code never sees the difference between a classic file and a fully compressed one. Both arrive at the validator as a resolved set of objects

var
  Pdf: TPdf;
  R  : TPdfXValidationResult;
begin
  Pdf := TPdf.Create(nil);
  try
    Pdf.FileName := 'Press_Ready.pdf';
    Pdf.Active := True;
    R := Pdf.ValidatePdfX;
    if R.IsCompliant then
      Writeln('PDF/X conformance: ', Ord(R.Conformance))
    else
      Writeln('Not conformant; issue count = ', SizeOf(R.Issues));
  finally
    Pdf.Free;
  end;
end;

If the bytes are already in memory rather than on disk, the same load-then-validate sequence works through the LoadDocument(const Data: TBytes) overload, which takes the raw file content and parses its cross-reference and object streams the same way the file path does. The takeaway for a hand-written validator is the structural rule, not the API: read the trailer keys from the stream dictionary in plaintext, expand every /ObjStm with a Flate decoder before you walk the document, and treat decoding the binary cross-reference entries as the larger, optional job it is

Once the structure is expanded, a validator can drive the rest of a workflow over it. For a command-line preflight harness that reports conformance across a folder of inputs, see our walkthrough on building a batch preflight report CLI. When validation is a gate ahead of breaking a large document apart, the techniques in our guide to splitting PDF documents into multiple files pair naturally with the load-and-check pattern shown here. Both build on the loading and validation surface of the PDFium Component for Delphi and C++Builder