Hardening a Pascal PDF Parser Against Malicious Files

A PDF is not a document you open. It is a small program you run. Every embedded font is a stack-based interpreter waiting for charstrings, every image is a decoder fed width, height, and bit-depth fields the file chose, and every stream arrives wrapped in filters whose parameters the file set. None of those numbers are yours. They came from whoever produced the file, which on a real workload is a customer's invoice or an attachment from an unknown sender. The decoders that turn those bytes into pixels and glyphs are the attack surface, and a parser that trusts its input there is one malformed file away from a crash or worse.

PDFlibPas went through a hardening pass that treated the whole decode path as hostile, across the font programs (TrueType, Type1, CFF, and the CMap tables), the image decoders (PNG, GIF, TIFF, JBIG2, and CCITT Group 3 and Group 4), and the stream filters (LZW, ASCII85, and the Flate predictors). What follows are five defect classes it closed, each grounded in the specific Delphi behavior that made it possible. They are fixed in current releases, and the same shapes recur in any Pascal code that parses untrusted input.

An integer overflow that hands you an undersized buffer

The classic memory-safety bug in an image decoder is a dimension product that wraps. A decoder reads width, height, component count, and bit depth, multiplies them to size its output, allocates that many bytes, then writes the image at its true dimensions. If the multiplication is done in 32-bit arithmetic, the product can wrap to a small value even when every individual factor is within a sane range, so the allocation succeeds, comes out far too small, and the decode walks off the end of it. This is CWE-190, integer overflow, leading to a heap out-of-bounds write (CWE-787) one step later.

The shared image path already clamped each dimension to 65535; the standalone decoders did not all inherit that clamp. A row-bytes-times-height expression such as ByteCount * FHeight, or a per-pixel expression such as FWidth * Components * BitDepth, is a 32-bit product in Delphi when both operands are 32-bit integers, regardless of how wide the variable you assign the result to is. A width and a height of 60000 are each plausible for a large scan, but their product in bytes overruns a signed 32-bit range and the length comes out small. The same trap lived in the ZLib predictor stride, BitsPerComponent * Colors * Columns.

The fix is to make at least one operand Int64 so the whole expression is evaluated in 64-bit, then compare against MaxInt and refuse the file before narrowing back down to call SetLength.

// Reject before allocating, not after writing.
// Evaluate the product in Int64 so it cannot wrap at 32 bits.
RowBytes := (Int64(FWidth) * Components * BitDepth + 7) div 8;
if (RowBytes <= 0) or (RowBytes * FHeight > MaxInt) then
  Exit;  // hostile or unsupportable dimensions; refuse the image
SetLength(Buffer, RowBytes * FHeight);

What makes this a Delphi problem rather than a generic one is the silent narrowing. Assigning a too-wide expression into a 32-bit destination is a legal conversion the compiler will not warn about by default, and range checking does not catch a wrap that happens before the value is ever used as an index. Leave the product at 32 bits and the language quietly gives you a length that lies about how much memory the decode is about to touch.

A field type that makes a guard impossible to fire

A TIFF file is a chain of image file directories, each carrying the byte offset of the next. A malicious file can point that chain back at itself, and a reader that walks it without a stop condition runs forever. That is CWE-835, an infinite loop driven by attacker-controlled input, and the defense is a counter that stops once it passes a limit no legitimate file would reach.

The page counter was declared as Word, which in Delphi holds 0 to 65535. The loop carried a termination guard of the form "stop when the page count exceeds 65535," which reads as correct until you notice the operand and the threshold share an upper bound. A Word can never be greater than 65535, so the comparison is structurally always false: when the counter reaches 65535 the next increment wraps it back to 0, the guard never sees a value above the ceiling, and a looping IFD chain keeps the reader spinning.

The fix was to widen the field so the guard can express a value the counter can actually hold. With TPDFTIFF.FPageCount declared as Integer, the same FPageCount > 65535 comparison becomes reachable, the loop terminates, and the public PageCount property changed type to match without breaking any caller. Whenever a bound check has the shape Value > MaxValueOfType(Value) and the operand is already typed at exactly that maximum, the condition is a constant false: widen the type, or test equality against the maximum so it can trigger.

Range checking turned off on a hot path

With range checking on, Delphi inserts a bounds check on every array and string index, which is the difference between an out-of-range index raising a catchable ERangeError and that same index reading or writing memory that does not belong to the structure. Hot paths sometimes disable it with a local {$R-} directive, which is defensible right up until the indices stop being trustworthy.

The list accessor the font interpreters lean on, TPDFlibStringList.Get, is exactly such a path. On Windows it is compiled with range checking off and indexes its backing store directly, so an out-of-range index is not an error but a raw memory access. That is fine when the index is always valid, and it stops being fine inside a CFF or Type2 charstring interpreter, where the index can come from the file. A charstring that pops an operand off an empty stack produces an index of negative one; a glyph identifier off by one against the glyph count indexes one slot past the end. With range checking off, both become a genuine out-of-bounds access instead of a catchable exception, and because the slots hold reference-counted AnsiString values, a stray read can also corrupt a string's reference count.

The hardening did not switch range checking back on for the hot path. It made the indices provably valid first: before taking the top of the operand stack the interpreter checks the stack is non-empty, and every index guard was written as a strict less-than against the count rather than a less-than-or-equal that admits the off-by-one. The directive moves responsibility for bounds from the compiler to you, and the validation it removed has to be put back by hand at every entry point.

Unbounded recursion in a charstring interpreter

A Type2 charstring can call a subroutine, and a subroutine is itself a charstring that can call another, so the local and global subroutine call operators let the file decide how deep it goes. A subroutine that calls itself, directly or through a cycle, recurses without end until the native stack is exhausted and the process dies. That is CWE-674, uncontrolled recursion.

The Type1 interpreter already guarded against this. It carried a call-depth counter and a ceiling, PLType1MaxCallDepth, and refused to descend past it, which reflects the depth limit the Type1 specification itself names. The Type2 interpreter, added later and structurally similar, did not carry the same guard, and a hand-built font with a subroutine that calls its own number walks straight through the missing check into a stack overflow.

// The shape of the Type1 guard the Type2 path was missing.
// Track depth across nested calls and refuse to recurse past it.
Inc(CallDepth);
if CallDepth > PLType1MaxCallDepth then
  Exit;  // hostile self-referential subroutine; stop descending
// ... interpret the subroutine, then Dec(CallDepth) on the way out

The fix was to give the Type2 path the same bounded depth its Type1 sibling already had. Any recursive descent over attacker-controlled structure, whether font subroutines, a nested array, or a cross-reference chain, needs a depth ceiling the input cannot lift.

Uninitialized memory that leaks into the output

The most subtle defect leaked heap contents into decrypted output, and the cause is a property of SetLength that is easy to forget. When you grow an AnsiString with SetLength, Delphi allocates the bytes but does not zero them, so the new region holds whatever was previously in that heap memory. If every byte is subsequently written, this never matters; if a path leaves part of the buffer unwritten and then returns it as data, those stale bytes ride out with the result. That is CWE-457, use of uninitialized memory, and when the result crosses a trust boundary it becomes an information leak.

The AES-CBC decryption path hit exactly this. The output buffer was sized with SetLength and the decryptor processed the ciphertext one 16-byte block at a time. When the ciphertext length was not a multiple of 16, a length an attacker can choose, the trailing partial block was never written, so those final bytes kept the heap contents SetLength left behind and the buffer was handed back as the decrypted plaintext of a document object. The remedy is two guards, and neither alone is enough: the decryption entry point now rejects any ciphertext whose length is not a multiple of the block size, and as a backstop the output is cleared with FillChar before use so any path that fails to write a region returns zeros rather than heap residue.

What the pass leaves you with

The five defects are different bugs, but they rhyme. An integer width that wraps a product, a field type that pins a guard to a constant false, a range check disabled where the indices stopped being safe, a recursion with no floor, and a buffer the language declined to zero. In each one Delphi did exactly what it defines, because the language gives you arithmetic that wraps, narrowing that is silent, range checks you can switch off, recursion with no built-in limit, and allocation that does not initialize. That is the contract, and a Pascal parser meets it by owning four things by hand at every boundary the file controls: integer width, range checking, recursion depth, and buffer initialization.

These defects are closed in current PDFlibPas releases, the engine for Delphi and C++Builder. If your work also reaches into how a file claims to be protected, the companion notes on auditing encryption and permissions and on PDF/A and PDF/UA preflight cover the analysis side of the same parser, and all of it ships inside the PDFlibPas Delphi PDF Library alongside the loading, rendering, and signing APIs covered elsewhere on this blog.