Native JBIG2 Bilevel Image Compression in Delphi PDFs

A scanned contract is a few hundred dots per inch of black ink on white paper. Stored as a one-bit-per-pixel bitmap it is already small, yet a hundred such pages still bloat a PDF past anything you would email. The right filter changes the arithmetic. JBIG2 is the highest-ratio compression that ISO 32000-1 defines for bilevel images, and on a stack of scanned text it routinely halves what CCITT Group 4 produces. This is the filter to reach for when the input is faxed, scanned, or otherwise reduced to two colors, and HotPDF can write it directly into a PDF

The format earns the ratio with two ideas that a generic image codec does not have. It models how black runs sit against a white background, and it notices that a scanned page is mostly the same few hundred glyph shapes repeated thousands of times. Understanding both is what lets you choose the encode options deliberately instead of guessing

Where JBIG2 sits in the PDF spec

ISO 32000-1 lists JBIG2Decode among the stream filters in §7.4.7, available from PDF 1.4 onward. It applies to one place only: image XObjects whose /BitsPerComponent is 1 and whose color space resolves to a single channel. That is the whole point. JBIG2 is a bilevel codec, so it never competes with DCT or JPXDecode on photographs. It competes with CCITTFaxDecode, the Group 3 and Group 4 fax filters, on exactly the kind of two-tone page that a document scanner produces

The decoder consumes the embedded JBIG2 organization that the standard calls the PDF profile, where each image stream holds a sequence of segments rather than a bare bitstream. An optional /JBIG2Globals stream carries segments shared across several images in the same document, which is the mechanism that lets repeated content be stored once for a whole file rather than once per page. HotPDF emits the per-image stream by default and keeps the globals channel free unless a backend asks for it

The backend-first encoder architecture

A complete JBIG2 encoder is a large piece of software, and the most aggressive parts of it have historically been encumbered by patents and shipped under licenses that do not suit every product. HotPDF resolves that tension by separating the interface from the engine. The unit HPDFJBIG2 defines the calls that the rest of the library makes, and it ships a modest built-in encoder so that JBIG2 works out of the box. When you need production-grade ratios you register a stronger engine and the library delegates to it, with no change to your calling code

The switch is a single registration call. With no backend registered, the encoder falls back to its built-in path. Register one and every subsequent encode runs through it

uses
  HPDFJBIG2;

// Query what is active, then optionally install a stronger engine.
if not IsJBIG2EncoderBackendAvailable then
  // Production backend not present: HotPDF uses its built-in MMR path.
  RegisterJBIG2EncoderBackend(MyVendorJBIG2Encode);

// Later, to return to the built-in behaviour:
// ClearJBIG2Backends;

The same hook exists for decoding through RegisterJBIG2DecoderBackend, with IsJBIG2DecoderBackendAvailable to probe it. This is why a library ships a small built-in path plus a backend seam rather than one monolithic encoder. The built-in path keeps the binary lean and free of license entanglements, while the seam lets a team that has licensed a full encoder plug it in without touching the PDF-writing layer at all

What the encode options actually trade

Encoding is configured through TJBIG2EncodeOptions, a record with the fields Lossless, UseGlobalSegments, UseSymbolDictionary, and LossyLevel. The component-friendly wrapper THPDFJBIG2Options publishes Lossless, UseSymbolDictionary, and LossyLevel so they can be set from the Object Inspector, and it converts to the record internally. Three intents drive the settings

Lossless reconstruction keeps every pixel. Set Lossless to True and leave LossyLevel at zero, and the decoded bitmap is bit-for-bit identical to the input. This is the only safe choice for line art, technical drawings, and any page where a dropped pixel could change meaning, such as a signature or a stamp. Symbol-dictionary coding turns on text-aware deduplication and is the option that separates JBIG2 from the fax filters. The lossy level, an integer from 0 to 9, lets a capable backend trade fidelity for size by treating near-identical marks as the same symbol. Zero means lossless. The built-in encoder honors only the lossless path and ignores any nonzero lossy level, so the higher levels take effect only once a backend that implements them is registered

var
  Options: TJBIG2EncodeOptions;
begin
  Options := DefaultJBIG2EncodeOptions;   // Lossless True, symbol dictionary on
  Options.Lossless := True;
  Options.LossyLevel := 0;                // 0 keeps every pixel
  Options.UseSymbolDictionary := True;    // dedupe repeated glyphs
  // Pass Options to a backend, or let THPDFJBIG2Options carry them.
end;

Symbol dictionaries and why text scans win

A page of scanned text is not really an image of words. It is the same letter e printed several hundred times, the same t, the same comma, each instance a slightly noisy copy of one underlying shape. A symbol dictionary captures that structure. The encoder collects the distinct marks on the page into a dictionary, stores each shape once, and then records the page as a list of positions that reference dictionary entries. A thousand occurrences of the same glyph cost one stored bitmap plus a thousand cheap placements

This is precisely where JBIG2 pulls ahead of CCITT Group 4. Group 4 codes each scan line against the line above it with no notion of a glyph, so it pays the full cost of every letter every time the letter appears. JBIG2 pays once. When the same dictionary is promoted to the document-level globals stream, the saving compounds across a multi-page scan, because the shapes shared by page after page are stored a single time for the whole file. On dense text the difference is not marginal. It is the reason JBIG2 exists

Generic region and MMR for everything else

Not every bilevel image is text. Maps, schematics, engineering drawings, and mixed pages have line art that no dictionary can summarize. For those, JBIG2 codes a generic region, a rectangle of pixels compressed directly without any symbol training. The standard allows a generic region to use MMR, the modified modified READ coding that Group 4 fax already uses, which models each row of pixels against the row above it

This is the path HotPDF ships in its built-in encoder. When no backend is registered and the request is lossless, the library compresses the bitmap as a single MMR generic region and wraps it in the JBIG2 segment structure that the PDF profile requires. It needs no dictionary, no training pass, and no second image to reference, so it is the dependable default for line art and mixed bilevel content. It will not match a full symbol-dictionary encoder on pure text, but it is always correct, always lossless, and always present. The encoder surface for it is one call

var
  Encoder: THPDFJBIG2Encoder;
  ImageData: TJBIG2ByteArray;
  Scanlines: TJBIG2ScanlineArray;  // one byte array per row, MSB-first
  W, H: Integer;
begin
  // Scanlines, W and H describe a 1-bit page; each row is (W + 7) div 8 bytes.
  Encoder := THPDFJBIG2Encoder.Create;
  try
    if Encoder.EncodeToByteArray(Scanlines, W, H, ImageData) then
      // ImageData now holds a JBIG2 stream ready for a /JBIG2Decode XObject.
      ;
  finally
    Encoder.Free;
  end;
end;

Turning it on when you build a document

For everyday use you do not touch the encoder class directly. HotPDF exposes JBIG2 as an image compression choice on the document. The enumeration THPDFImageCompressionType includes icJBIG2 alongside the Flate, JPEG, and CCITT options, and the document carries a JBIG2Options property of type THPDFJBIG2Options that holds the settings used when that compression is selected. Configure both before you add the bilevel images you want compressed this way

var
  Pdf: THotPDF;
begin
  Pdf := THotPDF.Create(nil);
  try
    Pdf.ImageCompressionType := icJBIG2;     // route 1-bit images through JBIG2
    Pdf.JBIG2Options.Lossless := True;        // keep every pixel
    Pdf.JBIG2Options.UseSymbolDictionary := True;
    Pdf.JBIG2Options.LossyLevel := 0;
    // Add pages and place your scanned 1-bit images here.
  finally
    Pdf.Free;
  end;
end;

One convenience worth noting is the DBGridHotPDFExport add-on, which renders a TDBGrid straight to a PDF. Its output is largely bilevel rules and text, so a document configured for JBIG2 keeps those exports compact without any extra handling on your part. Two related topics on this blog go deeper into the surrounding workflow. For how images and fonts are laid down when you build reports, see report output with fonts and images in Delphi. When a compressed document must satisfy an archival profile, the rules in PDF/A, PDF/X, and PDF/UA validation in Delphi tell you which filters a given conformance level accepts. JBIG2 ships as part of the HotPDF Component for Delphi and C++Builder, next to the loading, editing, and encryption APIs covered elsewhere here