Technischer Artikel

PDFlibPas: text, image, and font extraction in Delphi

Dieser deutsche Artikel behandelt PDFlibPas: text, image, and font extraction in Delphi für Teams, die mit Delphi, C++Builder, Lazarus/FPC und losLab-Komponenten arbeiten

Der Fokus liegt auf praxisnahen Entscheidungen, Fallstricken und Prüfpunkten, damit die Lösung im produktiven Einsatz verlässlich bleibt

Architekturentscheidungen

Separate extraction facts from interpretation. whether output needs visual order, content-stream order, or search-oriented order / image extraction format, color conversion, compression retention, and naming

  • whether output needs visual order, content-stream order, or search-oriented order
  • image extraction format, color conversion, compression retention, and naming
  • font subset naming, encoding diagnostics, and missing ToUnicode handling
  • confidence flags for OCR layers, hidden text, clipped content, and rotated pages

Implementierungsablauf

Preserve page and resource context. The order below keeps the workflow reviewable for Delphi and C++Builder teams.

  1. scan page resources and content streams while preserving object references
  2. extract text runs with coordinates, font identity, Unicode mapping, and style signals
  3. extract images with page location, dimensions, color space, and original object data when needed
  4. classify fonts by subset, embedded status, and encoding behavior
  5. produce an analysis report that distinguishes facts from inferred reading order

Validierungsnachweise

Extraction evidence that remains explainable. Keep these fields with the output or support record.

  • page number, object reference, coordinates, decoded text, font, and confidence
  • image size, color space, compression, mask, and export filename
  • font subset name, embedded state, encoding map, and ToUnicode status
  • warnings for hidden, clipped, rotated, or overlapping content

Extracted text is not always authored text

A professional extraction workflow should record where each text run, image, and font resource came from, how it was decoded, and which assumptions were used to group it into searchable or reviewable content.

Support package design

Once PDFlibPas is deployed, the most valuable support package is the one that explains the input, profile, output, and exact stage that failed.

  • page number, object reference, coordinates, decoded text, font, and confidence
  • image size, color space, compression, mask, and export filename
  • font subset name, embedded state, encoding map, and ToUnicode status
  • warnings for hidden, clipped, rotated, or overlapping content
  • terminology snapshot: text extraction, image extraction, font resource, ToUnicode

Praktischer Kontext

Nutzen Sie diese Prüfpunkte, um sicherzustellen, dass die Funktion über eine Demo hinausgeht und bei Veröffentlichung, Support und Kundeskalationen belastbar begründet werden kann

  • Einsatzfall vor der Codeänderung festlegen
  • Ergebnis mit kleinen Testdateien prüfen
  • API-Namen und Literalwerte unverändert lassen

Grenzfälle

  • PDF drawing order may not equal human reading order
  • ligatures and custom encodings can make copied text differ from visible text
  • images may be masks, soft masks, or repeated resources rather than standalone pictures
  • OCR layers can contain stale or misaligned text over scanned pages

Delphi / C++Builder notes

PDFlibPas should sit behind a small service boundary that receives files, streams, profiles, and credentials, then returns output paths, warnings, metrics, and validation status. Important terms include text extraction, image extraction, font resource, ToUnicode, content stream, coordinates.

Delphi-Codebeispiel

Die folgende Delphi-Skizze zeigt eine praktische Servicegrenze für dieses Thema. Halte Richtlinienprüfung, Protokollierung und Validierung außerhalb des engen Produktaufrufs, damit der Ablauf testbar bleibt.

procedure ExtractForIndexing(const FileName, OutputDir: string);
var
  Pdf: TPDFlib;
begin
  Pdf := TPDFlib.Create;
  try
    Pdf.LoadFromFile(FileName, '');
    SaveExtractedText(OutputDir, ExtractDocumentText(Pdf));
    SaveEmbeddedImages(OutputDir, ExtractDocumentImages(Pdf));
    SaveFontInventory(OutputDir, BuildFontInventory(Pdf));
  finally
    Pdf.Free;
  end;
end;

Produktionscheckliste

  • Run the workflow on an empty file, a normal customer file, and a worst-case file
  • Open the generated PDF with the target viewer, validator, printer, or downstream application
  • Log product version, profile version, input hash, output path, elapsed time, and warning count
  • Keep passwords, certificates, temporary files, and customer data under explicit retention rules
  • Add regression documents when a customer file exposes a new edge case

Product documentation

PDFlibPas

Zusätzliche Codebeispiele

var
  ImgList, I: Integer;
begin
  Pdf.SelectPage(1);
  ImgList := Pdf.GetPageImageList(0);
  for I := 0 to Pdf.GetImageListCount(ImgList) - 1 do
  begin
    Writeln(Pdf.GetImageListItemFormatDesc(ImgList, I, 0));
    Pdf.SaveImageListItemDataToFile(ImgList, I, 0,
      Format('page1-img%.2d.bin', [I]));
  end;
  Pdf.ReleaseImageList(ImgList);
end;
var
  I: Integer;
begin
  Pdf.FindFonts;
  for I := 1 to Pdf.FontCount do        // font indexes start at 1, not 0
    if Pdf.SelectFont(Pdf.GetFontID(I)) = 1 then
      Writeln(Format('%s  type=%d  embedded=%d  subset=%d',
        [Pdf.FontName, Pdf.FontType,
         Pdf.GetFontIsEmbedded, Pdf.GetFontIsSubsetted]));
end;