PDFlibPas: text, image, and font extraction in Delphi

losLab PDF Library zapewnia zespołom Delphi i C++Builder silnik PDF z dostępnym kodem źródłowym dla przepływów desktopowych, serwerowych, DLL, ActiveX i Dylib, z wbudowanymi kontrolami PDF/A i PDF/UA, podpisami PAdES oraz wyborem renderera bez wysyłania dokumentów do zewnętrznej usługi PDF

Ten artykuł jest przeznaczony dla teams building PDF analysis, migration, search, evidence capture, or support-inspection tools. Traktuje text, image, and font extraction jako produkcyjną inżynierię dokumentów, a nie pojedyncze wywołanie komponentu

Praktyczne ryzyko polega na tym, że extraction output is easy to over-trust even though PDF content order, font encoding, image color spaces, and page resources rarely match user-visible reading order exactly. Dlatego przepływ wymaga spisanego kontraktu, obserwowalnej diagnostyki i realistycznych plików regresyjnych

Decyzje architektoniczne

Separate extraction facts from interpretation. whether output needs visual order, content-stream order, or search-oriented order / image extraction format, color conversion, compression retention, and naming

whether output needs visual order, content-stream order, or search-oriented order
image extraction format, color conversion, compression retention, and naming
font subset naming, encoding diagnostics, and missing ToUnicode handling
confidence flags for OCR layers, hidden text, clipped content, and rotated pages

Przebieg implementacji

Preserve page and resource context. Poniższa kolejność zachowuje czytelność przepływu pracy dla zespołów Delphi i C++Builder

scan page resources and content streams while preserving object references
extract text runs with coordinates, font identity, Unicode mapping, and style signals
extract images with page location, dimensions, color space, and original object data when needed
classify fonts by subset, embedded status, and encoding behavior
produce an analysis report that distinguishes facts from inferred reading order

Dowody walidacji

Extraction evidence that remains explainable. Zachowaj te pola wraz z wynikiem lub rekordem wsparcia

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content

Extracted text is not always authored text

A professional extraction workflow should record where each text run, image, and font resource came from, how it was decoded, and which assumptions were used to group it into searchable or reviewable content

Support package design

Once PDFlibPas is deployed, the most valuable support package is the one that explains the input, profile, output, and exact stage that failed

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content
terminology snapshot: text extraction, image extraction, font resource, ToUnicode

Notatki przeglądu inżynierskiego dla text, image, and font extraction

Użyj tych notatek przeglądu, aby upewnić się, że funkcja wyszła poza demonstrację i da się ją obronić podczas wydania, wsparcia i eskalacji klienta

Decyzja: whether output needs visual order, content-stream order, or search-oriented order. Punkt nacisku implementacji: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Dowody akceptacji: font subset name, embedded state, encoding map, and ToUnicode status. Wyzwalacz regresji: OCR layers can contain stale or misaligned text over scanned pages
Decyzja: image extraction format, color conversion, compression retention, and naming. Punkt nacisku implementacji: extract images with page location, dimensions, color space, and original object data when needed. Dowody akceptacji: warnings for hidden, clipped, rotated, or overlapping content. Wyzwalacz regresji: PDF drawing order may not equal human reading order
Decyzja: font subset naming, encoding diagnostics, and missing ToUnicode handling. Punkt nacisku implementacji: classify fonts by subset, embedded status, and encoding behavior. Dowody akceptacji: page number, object reference, coordinates, decoded text, font, and confidence. Wyzwalacz regresji: ligatures and custom encodings can make copied text differ from visible text
Decyzja: confidence flags for OCR layers, hidden text, clipped content, and rotated pages. Punkt nacisku implementacji: produce an analysis report that distinguishes facts from inferred reading order. Dowody akceptacji: image size, color space, compression, mask, and export filename. Wyzwalacz regresji: images may be masks, soft masks, or repeated resources rather than standalone pictures
Decyzja: whether output needs visual order, content-stream order, or search-oriented order. Punkt nacisku implementacji: scan page resources and content streams while preserving object references. Dowody akceptacji: font subset name, embedded state, encoding map, and ToUnicode status. Wyzwalacz regresji: OCR layers can contain stale or misaligned text over scanned pages
Decyzja: image extraction format, color conversion, compression retention, and naming. Punkt nacisku implementacji: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Dowody akceptacji: warnings for hidden, clipped, rotated, or overlapping content. Wyzwalacz regresji: PDF drawing order may not equal human reading order
Decyzja: font subset naming, encoding diagnostics, and missing ToUnicode handling. Punkt nacisku implementacji: extract images with page location, dimensions, color space, and original object data when needed. Dowody akceptacji: page number, object reference, coordinates, decoded text, font, and confidence. Wyzwalacz regresji: ligatures and custom encodings can make copied text differ from visible text

Przypadki brzegowe

PDF drawing order may not equal human reading order
ligatures and custom encodings can make copied text differ from visible text
images may be masks, soft masks, or repeated resources rather than standalone pictures
OCR layers can contain stale or misaligned text over scanned pages

Delphi / C++Builder notes

PDFlibPas should sit behind a small service boundary that receives files, streams, profiles, and credentials, then returns output paths, warnings, metrics, and validation status. Important terms include text extraction, image extraction, font resource, ToUnicode, content stream, coordinates

Przykład kodu Delphi

Poniższy szkic Delphi pokazuje praktyczną granicę usługi dla tego tematu. Kontrole zasad, logowanie i walidację trzymaj poza wąskim blokiem wywołań produktu, aby przepływ pozostał testowalny

procedure ExtractForIndexing(const FileName, OutputDir: string);
var
  Pdf: TPDFlib;
begin
  Pdf := TPDFlib.Create;
  try
    Pdf.LoadFromFile(FileName, '');
    SaveExtractedText(OutputDir, ExtractDocumentText(Pdf));
    SaveEmbeddedImages(OutputDir, ExtractDocumentImages(Pdf));
    SaveFontInventory(OutputDir, BuildFontInventory(Pdf));
  finally
    Pdf.Free;
  end;
end;

Lista produkcyjna

Uruchom przepływ pracy na pustym pliku, zwykłym pliku klienta i pliku z najgorszego scenariusza
Otwórz wygenerowany plik PDF w docelowej przeglądarce, walidatorze, drukarce lub aplikacji nadrzędnej
Zaloguj wersję produktu, wersję profilu, hash wejścia, ścieżkę wyjścia, czas wykonania i liczbę ostrzeżeń
Przechowuj hasła, certyfikaty, pliki tymczasowe i dane klienta zgodnie z jednoznacznymi zasadami retencji
Dodaj dokument regresyjny, gdy plik klienta ujawni nowy przypadek brzegowy

Dokumentacja produktu

PDFlibPas

Dodatkowe przykłady kodu

var
  ImgList, I: Integer;
begin
  Pdf.SelectPage(1);
  ImgList := Pdf.GetPageImageList(0);
  for I := 0 to Pdf.GetImageListCount(ImgList) - 1 do
  begin
    Writeln(Pdf.GetImageListItemFormatDesc(ImgList, I, 0));
    Pdf.SaveImageListItemDataToFile(ImgList, I, 0,
      Format('page1-img%.2d.bin', [I]));
  end;
  Pdf.ReleaseImageList(ImgList);
end;

var
  I: Integer;
begin
  Pdf.FindFonts;
  for I := 1 to Pdf.FontCount do        // font indexes start at 1, not 0
    if Pdf.SelectFont(Pdf.GetFontID(I)) = 1 then
      Writeln(Format('%s  type=%d  embedded=%d  subset=%d',
        [Pdf.FontName, Pdf.FontType,
         Pdf.GetFontIsEmbedded, Pdf.GetFontIsSubsetted]));
end;