Technical Article

Extracting Images from PDFs with PDFium VCL in Delphi

PDF stores images as first-class objects inside its content streams. When a page references a photograph, a scan, or a diagram, the pixel data lives in an XObject dictionary alongside the page geometry. PDFium VCL surfaces that through two properties on TPdf: BitmapCount, which returns how many embedded bitmaps are on the current page, and Bitmap[Index], which decodes one of them into a TBitmap you own and must free. That is the whole extraction model. The loop is four lines; what takes judgment is the surrounding plumbing

Opening the document

The first thing to know about TPdf is that Active := True never raises. Load failures, wrong passwords, corrupted files: all of them are swallowed internally and the component simply stays inactive. You have to check the flag yourself after the assignment, or you will proceed into the page loop with PageCount returning zero and wonder why nothing extracted

var
  Pdf: TPdf;
begin
  Pdf := TPdf.Create(nil);
  try
    Pdf.FileName := 'report.pdf';
    Pdf.Active := True;
    if not Pdf.Active then
    begin
      Writeln('Failed to open: ', Pdf.FileName);
      Exit;
    end;
    Writeln(Pdf.PageCount, ' pages');
    // proceed to extraction
  finally
    Pdf.Free;
  end;
end;

Password-protected files follow the same pattern: assign Pdf.Password before setting Active := True. If the password is wrong, Active stays False and you get no exception to catch. In a batch tool processing hundreds of files, that silent behavior is actually useful: you accumulate the failures in a list rather than unwinding the call stack for each one

Iterating pages and pulling bitmaps

BitmapCount is per-page, so you set Pdf.PageNumber before reading it. Page numbers are 1-based; the default is 0, meaning no page is loaded. The Bitmap[Index] property is 0-based and returns a caller-owned TBitmap. You must free it. Neglect the free inside a long loop over a large document and memory climbs fast, because each bitmap can be several megabytes of raw pixel data before any compression

procedure ExtractAllImages(Pdf: TPdf; const OutputDir: string);
var
  Page, Idx: Integer;
  Bmp: TBitmap;
  OutPath: string;
begin
  for Page := 1 to Pdf.PageCount do
  begin
    Pdf.PageNumber := Page;
    for Idx := 0 to Pdf.BitmapCount - 1 do
    begin
      Bmp := Pdf.Bitmap[Idx];
      if not Assigned(Bmp) then
        Continue;
      try
        OutPath := Format('%s\p%d_img%d.bmp', [OutputDir, Page, Idx + 1]);
        Bmp.SaveToFile(OutPath);
      finally
        Bmp.Free;
      end;
    end;
  end;
end;

The Assigned guard matters. A small number of PDF generators write image XObjects with zero pixel dimensions or otherwise malformed data; in those cases the component returns nil rather than an empty bitmap. Treating a nil return as an error and stopping extraction is the wrong reflex: skip it, log the page and index if you need the audit trail, and continue. The rest of the page may still yield valid images

Notice that the outer loop sets Pdf.PageNumber on every iteration. That assignment is what loads the page into the component's internal state and makes BitmapCount meaningful. Skip it and you read the same page's count repeatedly. The pattern feels redundant when you write it, but it is how the API is designed: the page is a cursor, not a collection

Choosing an output format

BMP is lossless and always available without additional units, which makes it a sound default when you do not yet know what the image contains. When file size matters, the pixel format of the returned TBitmap tells you which codec is appropriate. A 32-bit bitmap carries an alpha channel; PNG preserves that without loss. A large 24-bit image with continuous tone is a candidate for JPEG. Smaller images or ones drawn with a limited palette are generally better left as BMP than run through JPEG, which adds blocking artifacts at low quality settings and saves little at high ones

procedure SaveBitmap(Bmp: TBitmap; const FileName: string);
var
  Jpg: TJPEGImage;
begin
  case UpperCase(ExtractFileExt(FileName)) of
    '.JPG', '.JPEG':
      begin
        Jpg := TJPEGImage.Create;
        try
          Jpg.Assign(Bmp);
          Jpg.CompressionQuality := 85;
          Jpg.SaveToFile(FileName);
        finally
          Jpg.Free;
        end;
      end;
  else
    Bmp.SaveToFile(FileName);  // BMP: lossless, no extra units
  end;
end;

In practice, format selection is driven by Bmp.PixelFormat and dimensions. If PixelFormat = pf32bit you need a format that carries alpha; PNG is the obvious choice, though it requires the PNGImage unit in older Delphi versions. For 24-bit images wider than roughly 300 pixels, JPEG at quality 85 gives a three-to-one size reduction over BMP with no perceptible loss in most photographic content. Below that threshold BMP is comparable in size and avoids any quality decision entirely

What BitmapCount does and does not count

PDF distinguishes between image XObjects and vector graphics drawn with path operators. A page that looks visually complex may return a BitmapCount of zero if every element is vector. Scanned pages almost always return exactly one: the scanner writes the entire scan as a single full-page image XObject at whatever resolution the scanner was set to. Pages that mix typeset text with embedded photographs return one entry per photograph. Decorative rule lines, shaded backgrounds, and table borders usually do not appear in the bitmap count at all

The count also does not include inline images, a rarely used PDF construct where image data is embedded directly in the page content stream rather than as a named XObject. Those fall outside what this API surfaces; they are uncommon enough in real documents that most extraction tools simply do not handle them

One detail worth keeping in mind: the BitmapCount you read is for the current page as of the last PageNumber assignment. If your code branches or calls any function that changes PageNumber between counting and fetching, you may read fewer images than you allocated space for, or index past the end. Keep the count read and the Bitmap[] loop on the same page without touching PageNumber in between

Using TPdfView in a form application

Memory and performance in batch jobs

Across a large archive the memory budget is the main thing to watch. Each Bitmap[] call allocates a new TBitmap on the heap, and on a 300 DPI scanned page that is easily 25 MB of raw pixel data before any encoding. If you process pages in a tight loop without freeing between iterations, the working set grows linearly with the number of images. The correct shape is always: fetch one bitmap, do what you need, free it, fetch the next. If you need to hold references to several bitmaps at once for a comparison step, count them first with BitmapCount and allocate your container accordingly, then free each one as soon as you are done with it rather than deferring to end-of-document cleanup. On a document with 500 scanned pages that distinction can mean the difference between 25 MB and 12 GB peak RSS

The TPdfView component exposes the same BitmapCount and Bitmap[] properties, but the page it reads from is the view's currently displayed page, not TPdf.PageNumber. The two page pointers are independent; setting one does not move the other. In a VCL form application with a live viewer, you can call Pdf.PageNumber := N to drive extraction through TPdf while the viewer stays on whatever the user last scrolled to. That separation is intentional and keeps the viewer's display state clean while a background extraction runs

The BitmapCount and Bitmap[] properties shown here are part of the PDFium VCL Component for Delphi and C++Builder