PDFlibPas: text, image, and font extraction in Delphi

losLab PDF Library, Delphi ve C++Builder ekiplerine masaüstü, sunucu, DLL, ActiveX ve Dylib iş akışları için kaynak kodlu bir PDF motoru sağlar; dahili PDF/A ve PDF/UA kontrolleri, PAdES imzalama ve belgeleri harici PDF servisine göndermeden renderer seçimi sunar

Bu yazı teams building PDF analysis, migration, search, evidence capture, or support-inspection tools için hazırlanmıştır. text, image, and font extraction konusunu tek bir bileşen çağrısı olarak değil, üretim düzeyinde belge mühendisliği olarak ele alır

Pratik risk şudur: extraction output is easy to over-trust even though PDF content order, font encoding, image color spaces, and page resources rarely match user-visible reading order exactly. Bu nedenle akışın yazılı sözleşmeye, gözlemlenebilir tanılara ve gerçekçi regresyon dosyalarına ihtiyacı vardır

Mimari kararlar

Separate extraction facts from interpretation. whether output needs visual order, content-stream order, or search-oriented order / image extraction format, color conversion, compression retention, and naming

whether output needs visual order, content-stream order, or search-oriented order
image extraction format, color conversion, compression retention, and naming
font subset naming, encoding diagnostics, and missing ToUnicode handling
confidence flags for OCR layers, hidden text, clipped content, and rotated pages

Uygulama akışı

Preserve page and resource context. Aşağıdaki sıra, iş akışını Delphi ve C++Builder ekipleri için incelenebilir tutar

scan page resources and content streams while preserving object references
extract text runs with coordinates, font identity, Unicode mapping, and style signals
extract images with page location, dimensions, color space, and original object data when needed
classify fonts by subset, embedded status, and encoding behavior
produce an analysis report that distinguishes facts from inferred reading order

Doğrulama kanıtı

Extraction evidence that remains explainable. Bu alanları çıktı veya destek kaydıyla birlikte saklayın

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content

Extracted text is not always authored text

A professional extraction workflow should record where each text run, image, and font resource came from, how it was decoded, and which assumptions were used to group it into searchable or reviewable content

Support package design

Once PDFlibPas is deployed, the most valuable support package is the one that explains the input, profile, output, and exact stage that failed

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content
terminology snapshot: text extraction, image extraction, font resource, ToUnicode

Mühendislik inceleme notları: text, image, and font extraction

Özelliğin bir demoyu aşıp sürüm, destek ve müşteri eskalasyonu sırasında savunulabilir olduğunu doğrulamak için bu inceleme notlarını kullanın

Karar: whether output needs visual order, content-stream order, or search-oriented order. Uygulama baskı noktası: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Kabul kanıtı: font subset name, embedded state, encoding map, and ToUnicode status. Regresyon tetikleyicisi: OCR layers can contain stale or misaligned text over scanned pages
Karar: image extraction format, color conversion, compression retention, and naming. Uygulama baskı noktası: extract images with page location, dimensions, color space, and original object data when needed. Kabul kanıtı: warnings for hidden, clipped, rotated, or overlapping content. Regresyon tetikleyicisi: PDF drawing order may not equal human reading order
Karar: font subset naming, encoding diagnostics, and missing ToUnicode handling. Uygulama baskı noktası: classify fonts by subset, embedded status, and encoding behavior. Kabul kanıtı: page number, object reference, coordinates, decoded text, font, and confidence. Regresyon tetikleyicisi: ligatures and custom encodings can make copied text differ from visible text
Karar: confidence flags for OCR layers, hidden text, clipped content, and rotated pages. Uygulama baskı noktası: produce an analysis report that distinguishes facts from inferred reading order. Kabul kanıtı: image size, color space, compression, mask, and export filename. Regresyon tetikleyicisi: images may be masks, soft masks, or repeated resources rather than standalone pictures
Karar: whether output needs visual order, content-stream order, or search-oriented order. Uygulama baskı noktası: scan page resources and content streams while preserving object references. Kabul kanıtı: font subset name, embedded state, encoding map, and ToUnicode status. Regresyon tetikleyicisi: OCR layers can contain stale or misaligned text over scanned pages
Karar: image extraction format, color conversion, compression retention, and naming. Uygulama baskı noktası: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Kabul kanıtı: warnings for hidden, clipped, rotated, or overlapping content. Regresyon tetikleyicisi: PDF drawing order may not equal human reading order
Karar: font subset naming, encoding diagnostics, and missing ToUnicode handling. Uygulama baskı noktası: extract images with page location, dimensions, color space, and original object data when needed. Kabul kanıtı: page number, object reference, coordinates, decoded text, font, and confidence. Regresyon tetikleyicisi: ligatures and custom encodings can make copied text differ from visible text

Sınır durumları

PDF drawing order may not equal human reading order
ligatures and custom encodings can make copied text differ from visible text
images may be masks, soft masks, or repeated resources rather than standalone pictures
OCR layers can contain stale or misaligned text over scanned pages

Delphi / C++Builder notes

PDFlibPas should sit behind a small service boundary that receives files, streams, profiles, and credentials, then returns output paths, warnings, metrics, and validation status. Important terms include text extraction, image extraction, font resource, ToUnicode, content stream, coordinates

Delphi kod örneği

Aşağıdaki Delphi taslağı bu konu için pratik bir servis sınırını gösterir. Politika kontrollerini, günlüklemeyi ve doğrulamayı dar ürün çağrısı bölümünün dışında tutarak akışı test edilebilir bırakın

procedure ExtractForIndexing(const FileName, OutputDir: string);
var
  Pdf: TPDFlib;
begin
  Pdf := TPDFlib.Create;
  try
    Pdf.LoadFromFile(FileName, '');
    SaveExtractedText(OutputDir, ExtractDocumentText(Pdf));
    SaveEmbeddedImages(OutputDir, ExtractDocumentImages(Pdf));
    SaveFontInventory(OutputDir, BuildFontInventory(Pdf));
  finally
    Pdf.Free;
  end;
end;

Üretim kontrol listesi

İş akışını boş bir dosyada, normal bir müşteri dosyasında ve en kötü durum dosyasında çalıştırın
Oluşturulan PDF'yi hedef görüntüleyici, doğrulayıcı, yazıcı veya aşağı akış uygulamasıyla açın
Ürün sürümünü, profil sürümünü, giriş karmasını, çıktı yolunu, geçen süreyi ve uyarı sayısını kaydedin
Parolaları, sertifikaları, geçici dosyaları ve müşteri verilerini açık saklama kuralları altında tutun
Bir müşteri dosyası yeni bir uç durum ortaya çıkardığında regresyon belgeleri ekleyin

Ürün belgeleri

PDFlibPas

Ek kod örnekleri

var
  ImgList, I: Integer;
begin
  Pdf.SelectPage(1);
  ImgList := Pdf.GetPageImageList(0);
  for I := 0 to Pdf.GetImageListCount(ImgList) - 1 do
  begin
    Writeln(Pdf.GetImageListItemFormatDesc(ImgList, I, 0));
    Pdf.SaveImageListItemDataToFile(ImgList, I, 0,
      Format('page1-img%.2d.bin', [I]));
  end;
  Pdf.ReleaseImageList(ImgList);
end;

var
  I: Integer;
begin
  Pdf.FindFonts;
  for I := 1 to Pdf.FontCount do        // font indexes start at 1, not 0
    if Pdf.SelectFont(Pdf.GetFontID(I)) = 1 then
      Writeln(Format('%s  type=%d  embedded=%d  subset=%d',
        [Pdf.FontName, Pdf.FontType,
         Pdf.GetFontIsEmbedded, Pdf.GetFontIsSubsetted]));
end;