PDFlibPas: Delphi での text, image, and font extraction

losLab PDF Library は、Delphi/C++Builder チーム向けにソース提供の PDF エンジンを提供します。デスクトップ、サーバー、DLL、ActiveX、Dylib ワークフローで使え、PDF/A・PDF/UA チェック、PAdES 署名、複数レンダラーを外部 PDF サービスなしで利用できます。

この記事は teams building PDF analysis, migration, search, evidence capture, or support-inspection tools 向けです。text, image, and font extraction を単なるコンポーネント呼び出しではなく、本番向けのドキュメントエンジニアリングとして扱います。

実務上のリスクは extraction output is easy to over-trust even though PDF content order, font encoding, image color spaces, and page resources rarely match user-visible reading order exactly です。そのため、明確な契約、観測可能な診断、実際の顧客ファイルに近い回帰サンプルが必要です。

アーキテクチャ上の判断

Separate extraction facts from interpretation. whether output needs visual order, content-stream order, or search-oriented order / image extraction format, color conversion, compression retention, and naming

whether output needs visual order, content-stream order, or search-oriented order
image extraction format, color conversion, compression retention, and naming
font subset naming, encoding diagnostics, and missing ToUnicode handling
confidence flags for OCR layers, hidden text, clipped content, and rotated pages

実装フロー

Preserve page and resource context. The order below keeps the workflow reviewable for Delphi and C++Builder teams.

scan page resources and content streams while preserving object references
extract text runs with coordinates, font identity, Unicode mapping, and style signals
extract images with page location, dimensions, color space, and original object data when needed
classify fonts by subset, embedded status, and encoding behavior
produce an analysis report that distinguishes facts from inferred reading order

検証エビデンス

Extraction evidence that remains explainable. Keep these fields with the output or support record.

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content

Extracted text is not always authored text

A professional extraction workflow should record where each text run, image, and font resource came from, how it was decoded, and which assumptions were used to group it into searchable or reviewable content.

Support package design

Once PDFlibPas is deployed, the most valuable support package is the one that explains the input, profile, output, and exact stage that failed.

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content
terminology snapshot: text extraction, image extraction, font resource, ToUnicode

Engineering review notes for text, image, and font extraction

Use these review notes to make sure the feature has moved beyond a demo and can be defended during release, support, and customer escalation.

Decision: whether output needs visual order, content-stream order, or search-oriented order. Implementation pressure point: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Acceptance evidence: font subset name, embedded state, encoding map, and ToUnicode status. Regression trigger: OCR layers can contain stale or misaligned text over scanned pages
Decision: image extraction format, color conversion, compression retention, and naming. Implementation pressure point: extract images with page location, dimensions, color space, and original object data when needed. Acceptance evidence: warnings for hidden, clipped, rotated, or overlapping content. Regression trigger: PDF drawing order may not equal human reading order
Decision: font subset naming, encoding diagnostics, and missing ToUnicode handling. Implementation pressure point: classify fonts by subset, embedded status, and encoding behavior. Acceptance evidence: page number, object reference, coordinates, decoded text, font, and confidence. Regression trigger: ligatures and custom encodings can make copied text differ from visible text
Decision: confidence flags for OCR layers, hidden text, clipped content, and rotated pages. Implementation pressure point: produce an analysis report that distinguishes facts from inferred reading order. Acceptance evidence: image size, color space, compression, mask, and export filename. Regression trigger: images may be masks, soft masks, or repeated resources rather than standalone pictures
Decision: whether output needs visual order, content-stream order, or search-oriented order. Implementation pressure point: scan page resources and content streams while preserving object references. Acceptance evidence: font subset name, embedded state, encoding map, and ToUnicode status. Regression trigger: OCR layers can contain stale or misaligned text over scanned pages
Decision: image extraction format, color conversion, compression retention, and naming. Implementation pressure point: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Acceptance evidence: warnings for hidden, clipped, rotated, or overlapping content. Regression trigger: PDF drawing order may not equal human reading order
Decision: font subset naming, encoding diagnostics, and missing ToUnicode handling. Implementation pressure point: extract images with page location, dimensions, color space, and original object data when needed. Acceptance evidence: page number, object reference, coordinates, decoded text, font, and confidence. Regression trigger: ligatures and custom encodings can make copied text differ from visible text

境界ケース

PDF drawing order may not equal human reading order
ligatures and custom encodings can make copied text differ from visible text
images may be masks, soft masks, or repeated resources rather than standalone pictures
OCR layers can contain stale or misaligned text over scanned pages

Delphi / C++Builder notes

PDFlibPas should sit behind a small service boundary that receives files, streams, profiles, and credentials, then returns output paths, warnings, metrics, and validation status. Important terms include text extraction, image extraction, font resource, ToUnicode, content stream, coordinates.

Delphi コード例

次の Delphi スケッチは、このテーマに対する実用的なサービス境界を示します。ポリシー確認、ログ記録、検証を製品呼び出しの狭い部分の外側に置くと、ワークフローをテストしやすくなります。

procedure ExtractForIndexing(const FileName, OutputDir: string);
var
  Pdf: TPDFlib;
begin
  Pdf := TPDFlib.Create;
  try
    Pdf.LoadFromFile(FileName, '');
    SaveExtractedText(OutputDir, ExtractDocumentText(Pdf));
    SaveEmbeddedImages(OutputDir, ExtractDocumentImages(Pdf));
    SaveFontInventory(OutputDir, BuildFontInventory(Pdf));
  finally
    Pdf.Free;
  end;
end;

本番チェックリスト

Run the workflow on an empty file, a normal customer file, and a worst-case file
Open the generated PDF with the target viewer, validator, printer, or downstream application
Log product version, profile version, input hash, output path, elapsed time, and warning count
Keep passwords, certificates, temporary files, and customer data under explicit retention rules
Add regression documents when a customer file exposes a new edge case

Product documentation

PDFlibPas