losLab PDF Library ger Delphi- och C++Builder-team en PDF-motor med tillgänglig källkod för skrivbord, server, DLL, ActiveX och Dylib, med inbyggda PDF/A- och PDF/UA-kontroller, PAdES-signering och valbara renderare utan extern PDF-tjänst.
Den här artikeln är skriven för teams building PDF analysis, migration, search, evidence capture, or support-inspection tools. Den behandlar text, image, and font extraction som produktionsnära dokumentteknik, inte som ett isolerat komponentanrop.
Den praktiska risken är att extraction output is easy to over-trust even though PDF content order, font encoding, image color spaces, and page resources rarely match user-visible reading order exactly. Därför behöver flödet ett skrivet kontrakt, observerbar diagnostik och realistiska regressionsfiler.
Arkitekturbeslut
Separate extraction facts from interpretation. whether output needs visual order, content-stream order, or search-oriented order / image extraction format, color conversion, compression retention, and naming
- whether output needs visual order, content-stream order, or search-oriented order
- image extraction format, color conversion, compression retention, and naming
- font subset naming, encoding diagnostics, and missing ToUnicode handling
- confidence flags for OCR layers, hidden text, clipped content, and rotated pages
Implementeringsflöde
Preserve page and resource context. Ordningen nedan gör arbetsflödet granskbart för Delphi- och C++Builder-team.
- scan page resources and content streams while preserving object references
- extract text runs with coordinates, font identity, Unicode mapping, and style signals
- extract images with page location, dimensions, color space, and original object data when needed
- classify fonts by subset, embedded status, and encoding behavior
- produce an analysis report that distinguishes facts from inferred reading order
Valideringsbevis
Extraction evidence that remains explainable. Behåll dessa fält tillsammans med utdata eller supportunderlaget.
- page number, object reference, coordinates, decoded text, font, and confidence
- image size, color space, compression, mask, and export filename
- font subset name, embedded state, encoding map, and ToUnicode status
- warnings for hidden, clipped, rotated, or overlapping content
Extracted text is not always authored text
A professional extraction workflow should record where each text run, image, and font resource came from, how it was decoded, and which assumptions were used to group it into searchable or reviewable content.
Support package design
Once PDFlibPas is deployed, the most valuable support package is the one that explains the input, profile, output, and exact stage that failed.
- page number, object reference, coordinates, decoded text, font, and confidence
- image size, color space, compression, mask, and export filename
- font subset name, embedded state, encoding map, and ToUnicode status
- warnings for hidden, clipped, rotated, or overlapping content
- terminology snapshot: text extraction, image extraction, font resource, ToUnicode
Tekniska granskningsnoteringar för text, image, and font extraction
Använd dessa granskningsnoteringar för att säkerställa att funktionen har passerat demo-nivån och kan försvaras under leverans, support och kundeskalering.
- Beslut: whether output needs visual order, content-stream order, or search-oriented order. Implementeringspresspunkt: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Acceptansbevis: font subset name, embedded state, encoding map, and ToUnicode status. Regressionsutlösare: OCR layers can contain stale or misaligned text over scanned pages
- Beslut: image extraction format, color conversion, compression retention, and naming. Implementeringspresspunkt: extract images with page location, dimensions, color space, and original object data when needed. Acceptansbevis: warnings for hidden, clipped, rotated, or overlapping content. Regressionsutlösare: PDF drawing order may not equal human reading order
- Beslut: font subset naming, encoding diagnostics, and missing ToUnicode handling. Implementeringspresspunkt: classify fonts by subset, embedded status, and encoding behavior. Acceptansbevis: page number, object reference, coordinates, decoded text, font, and confidence. Regressionsutlösare: ligatures and custom encodings can make copied text differ from visible text
- Beslut: confidence flags for OCR layers, hidden text, clipped content, and rotated pages. Implementeringspresspunkt: produce an analysis report that distinguishes facts from inferred reading order. Acceptansbevis: image size, color space, compression, mask, and export filename. Regressionsutlösare: images may be masks, soft masks, or repeated resources rather than standalone pictures
- Beslut: whether output needs visual order, content-stream order, or search-oriented order. Implementeringspresspunkt: scan page resources and content streams while preserving object references. Acceptansbevis: font subset name, embedded state, encoding map, and ToUnicode status. Regressionsutlösare: OCR layers can contain stale or misaligned text over scanned pages
- Beslut: image extraction format, color conversion, compression retention, and naming. Implementeringspresspunkt: extract text runs with coordinates, font identity, Unicode mapping, and style signals. Acceptansbevis: warnings for hidden, clipped, rotated, or overlapping content. Regressionsutlösare: PDF drawing order may not equal human reading order
- Beslut: font subset naming, encoding diagnostics, and missing ToUnicode handling. Implementeringspresspunkt: extract images with page location, dimensions, color space, and original object data when needed. Acceptansbevis: page number, object reference, coordinates, decoded text, font, and confidence. Regressionsutlösare: ligatures and custom encodings can make copied text differ from visible text
Gränsfall
- PDF drawing order may not equal human reading order
- ligatures and custom encodings can make copied text differ from visible text
- images may be masks, soft masks, or repeated resources rather than standalone pictures
- OCR layers can contain stale or misaligned text over scanned pages
Delphi / C++Builder notes
PDFlibPas should sit behind a small service boundary that receives files, streams, profiles, and credentials, then returns output paths, warnings, metrics, and validation status. Important terms include text extraction, image extraction, font resource, ToUnicode, content stream, coordinates.
Delphi-kodexempel
Följande Delphi-skiss visar en praktisk servicegräns för detta ämne. Håll policykontroller, loggning och validering utanför det smala produktanropet så att arbetsflödet går att testa.
procedure ExtractForIndexing(const FileName, OutputDir: string);
var
Pdf: TPDFlib;
begin
Pdf := TPDFlib.Create;
try
Pdf.LoadFromFile(FileName, '');
SaveExtractedText(OutputDir, ExtractDocumentText(Pdf));
SaveEmbeddedImages(OutputDir, ExtractDocumentImages(Pdf));
SaveFontInventory(OutputDir, BuildFontInventory(Pdf));
finally
Pdf.Free;
end;
end;
Produktionschecklista
- Kör arbetsflödet på en tom fil, en normal kundfil och en värstafallfil
- Öppna den genererade PDF-filen med rätt visare, validator, skrivare eller nedströmsapplikation
- Logga produktversion, profilversion, inmatningshash, utdatasökväg, förfluten tid och antal varningar
- Håll lösenord, certifikat, tillfälliga filer och kunddata under tydliga lagringsregler
- Lägg till regressionsdokument när en kundfil avslöjar ett nytt gränsfall
Produktdokumentation
Fler kodexempel
var
ImgList, I: Integer;
begin
Pdf.SelectPage(1);
ImgList := Pdf.GetPageImageList(0);
for I := 0 to Pdf.GetImageListCount(ImgList) - 1 do
begin
Writeln(Pdf.GetImageListItemFormatDesc(ImgList, I, 0));
Pdf.SaveImageListItemDataToFile(ImgList, I, 0,
Format('page1-img%.2d.bin', [I]));
end;
Pdf.ReleaseImageList(ImgList);
end;var
I: Integer;
begin
Pdf.FindFonts;
for I := 1 to Pdf.FontCount do // font indexes start at 1, not 0
if Pdf.SelectFont(Pdf.GetFontID(I)) = 1 then
Writeln(Format('%s type=%d embedded=%d subset=%d',
[Pdf.FontName, Pdf.FontType,
Pdf.GetFontIsEmbedded, Pdf.GetFontIsSubsetted]));
end;