PDFlibPas: извлечение текста, изображений и шрифтов в Delphi

losLab PDF Library предоставляет командам Delphi и C++Builder PDF-движок с доступным исходным кодом для настольных, серверных, DLL, ActiveX и Dylib процессов, включая встроенные проверки PDF/A и PDF/UA, подписи PAdES и выбор рендерера без отправки документов во внешний PDF-сервис

Эта статья предназначена для teams building PDF analysis, migration, search, evidence capture, or support-inspection tools. Она рассматривает извлечение текста, изображений и шрифтов как промышленную инженерию документов, а не как одиночный вызов компонента

Практический риск состоит в том, что extraction output is easy to over-trust even though PDF content order, font encoding, image color spaces, and page resources rarely match user-visible reading order exactly. Поэтому процессу нужны письменный контракт, наблюдаемая диагностика и реалистичные регрессионные файлы

Архитектурные решения

Separate extraction facts from interpretation. whether output needs visual order, content-stream order, or search-oriented order / image extraction format, color conversion, compression retention, and naming

whether output needs visual order, content-stream order, or search-oriented order
image extraction format, color conversion, compression retention, and naming
font subset naming, encoding diagnostics, and missing ToUnicode handling
confidence flags for OCR layers, hidden text, clipped content, and rotated pages

Порядок реализации

Preserve page and resource context. The order below keeps the workflow reviewable for Delphi and C++Builder teams

scan page resources and content streams while preserving object references
extract text runs with coordinates, font identity, Unicode mapping, and style signals
extract images with page location, dimensions, color space, and original object data when needed
classify fonts by subset, embedded status, and encoding behavior
produce an analysis report that distinguishes facts from inferred reading order

Доказательства проверки

Extraction evidence that remains explainable. Keep these fields with the output or support record

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content

Extracted text is not always authored text

A professional extraction workflow should record where each text run, image, and font resource came from, how it was decoded, and which assumptions were used to group it into searchable or reviewable content

Проектирование support-пакета

После развертывания PDFlibPas наиболее полезен пакет поддержки, который объясняет входные данные, профиль, выход и точную стадию сбоя

page number, object reference, coordinates, decoded text, font, and confidence
image size, color space, compression, mask, and export filename
font subset name, embedded state, encoding map, and ToUnicode status
warnings for hidden, clipped, rotated, or overlapping content
terminology snapshot: text extraction, image extraction, font resource, ToUnicode

Замечания для инженерного ревью по извлечение текста, изображений и шрифтов

Используйте эти замечания, чтобы убедиться, что функция вышла за рамки демо и может быть обоснована на релизе, в поддержке и при эскалации клиента

Решение: whether output needs visual order, content-stream order, or search-oriented order. Точка приложения при реализации: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Доказательство приемки: font subset name, embedded state, encoding map, and ToUnicode status. Триггер регрессии: OCR layers can contain stale or misaligned text over scanned pages
Решение: image extraction format, color conversion, compression retention, and naming. Точка приложения при реализации: extract images with page location, dimensions, color space, and original object data when needed. Доказательство приемки: warnings for hidden, clipped, rotated, or overlapping content. Триггер регрессии: PDF drawing order may not equal human reading order
Решение: font subset naming, encoding diagnostics, and missing ToUnicode handling. Точка приложения при реализации: classify fonts by subset, embedded status, and encoding behavior. Доказательство приемки: page number, object reference, coordinates, decoded text, font, and confidence. Триггер регрессии: ligatures and custom encodings can make copied text differ from visible text
Решение: confidence flags for OCR layers, hidden text, clipped content, and rotated pages. Точка приложения при реализации: produce an analysis report that distinguishes facts from inferred reading order. Доказательство приемки: image size, color space, compression, mask, and export filename. Триггер регрессии: images may be masks, soft masks, or repeated resources rather than standalone pictures
Решение: whether output needs visual order, content-stream order, or search-oriented order. Точка приложения при реализации: scan page resources and content streams while preserving object references. Доказательство приемки: font subset name, embedded state, encoding map, and ToUnicode status. Триггер регрессии: OCR layers can contain stale or misaligned text over scanned pages
Решение: image extraction format, color conversion, compression retention, and naming. Точка приложения при реализации: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Доказательство приемки: warnings for hidden, clipped, rotated, or overlapping content. Триггер регрессии: PDF drawing order may not equal human reading order
Решение: font subset naming, encoding diagnostics, and missing ToUnicode handling. Точка приложения при реализации: extract images with page location, dimensions, color space, and original object data when needed. Доказательство приемки: page number, object reference, coordinates, decoded text, font, and confidence. Триггер регрессии: ligatures and custom encodings can make copied text differ from visible text

Пограничные случаи

PDF drawing order may not equal human reading order
ligatures and custom encodings can make copied text differ from visible text
images may be masks, soft masks, or repeated resources rather than standalone pictures
OCR layers can contain stale or misaligned text over scanned pages

Примечания по Delphi / C++Builder

PDFlibPas should sit behind a small service boundary that receives files, streams, profiles, and credentials, then returns output paths, warnings, metrics, and validation status. Важные термины включают text extraction, image extraction, font resource, ToUnicode, content stream, coordinates

Пример кода Delphi

Следующий эскиз Delphi показывает практическую границу сервиса для этой темы. Оставляйте проверки политики, журналирование и валидацию вне узкого блока вызова продукта, чтобы сценарий было проще тестировать

procedure ExtractForIndexing(const FileName, OutputDir: string);
var
  Pdf: TPDFlib;
begin
  Pdf := TPDFlib.Create;
  try
    Pdf.LoadFromFile(FileName, '');
    SaveExtractedText(OutputDir, ExtractDocumentText(Pdf));
    SaveEmbeddedImages(OutputDir, ExtractDocumentImages(Pdf));
    SaveFontInventory(OutputDir, BuildFontInventory(Pdf));
  finally
    Pdf.Free;
  end;
end;

Производственный чек-лист

Запускайте сценарий на пустом файле, обычном клиентском файле и файле худшего случая
Открывайте сгенерированный PDF в целевом просмотрщике, валидаторе, принтере или downstream-приложении
Записывайте версию продукта, версию профиля, хэш входа, путь вывода, затраченное время и число предупреждений
Храните пароли, сертификаты, временные файлы и данные клиентов по явным правилам хранения
Добавляйте регрессионные документы, когда клиентский файл выявляет новый граничный случай

Документация по продукту

PDFlibPas

Дополнительные примеры кода

var
  ImgList, I: Integer;
begin
  Pdf.SelectPage(1);
  ImgList := Pdf.GetPageImageList(0);
  for I := 0 to Pdf.GetImageListCount(ImgList) - 1 do
  begin
    Writeln(Pdf.GetImageListItemFormatDesc(ImgList, I, 0));
    Pdf.SaveImageListItemDataToFile(ImgList, I, 0,
      Format('page1-img%.2d.bin', [I]));
  end;
  Pdf.ReleaseImageList(ImgList);
end;

var
  I: Integer;
begin
  Pdf.FindFonts;
  for I := 1 to Pdf.FontCount do        // font indexes start at 1, not 0
    if Pdf.SelectFont(Pdf.GetFontID(I)) = 1 then
      Writeln(Format('%s  type=%d  embedded=%d  subset=%d',
        [Pdf.FontName, Pdf.FontType,
         Pdf.GetFontIsEmbedded, Pdf.GetFontIsSubsetted]));
end;