Feed forty thousand customer PDFs into a search indexer and the failures sort themselves into three piles: documents whose words glue together because spacing thresholds were ignored, documents whose extracted text is gibberish because a subsetted font ships no ToUnicode map, and documents where "the logo" turns out to be nine separate image XObjects plus a soft mask. Extraction is not one API call — it is a chain of decisions about ordering, encoding, and provenance, and each decision changes what downstream systems can trust. losLab PDF Library (PDFlibPas) gives Delphi and C++Builder applications several extraction levels with different fidelity contracts; choosing among them deliberately is most of the work.
Text extraction levels and what each one promises
GetPageText takes an options value from 0 through 8, and the number selects an engine, not a format. Options 0 through 2 run a lightweight pass adequate for quick previews. Options 3 through 8 route through the layout-aware extraction engine, which reconstructs lines and spacing from glyph geometry: 4 and 6 additionally split output into words, 5 and 6 emit width information, and 7 produces plain text while deliberately ignoring font, color, and block metadata — the usual choice for feeding a search index.
Whatever the level, fidelity is bounded by the document itself. PDF maps character codes to glyphs, and only a font's ToUnicode CMap (ISO 32000-1 §9.10) maps the codes back to text. A subsetted font without one leaves every extractor — this one, a viewer's copy-paste, anything — guessing from glyph names or giving up. Detecting that case matters more than handling it: flag the page as low-confidence and route it to OCR rather than indexing garbage silently.
When the flat options do not fit — custom tokenization, content-stream forensics, building a bespoke text funnel — the class layer exposes the decoder directly: TPDFExtractor is constructed over a page's resources dictionary and font collection, its ExtractTextW method turns raw content-stream text operations into Unicode through the same font machinery, and its OnFindObject event surfaces each object as it streams past. Most applications never need to drop this far; the ones that do are glad the layer is public.
Positioned blocks: the unit of search hits and redaction review
Plain text answers what the page says; most products eventually need to know where it says it — to highlight a search hit, to verify a redaction candidate, to anchor an annotation. ExtractPageTextBlocks returns a handle to a list of text runs, each carrying its text, bounding box, font name, and size:
var
Pdf: TPDFlib;
Blocks, I: Integer;
begin
Pdf := TPDFlib.Create;
try
if Pdf.LoadFromFile('contract.pdf', '') <> 1 then
raise Exception.Create('load failed');
Pdf.SelectPage(1);
Blocks := Pdf.ExtractPageTextBlocks(0);
for I := 0 to Pdf.GetTextBlockCount(Blocks) - 1 do
Writeln(Format('%s [%s %.1f pt at %.0f,%.0f]',
[Pdf.GetTextBlockText(Blocks, I),
Pdf.GetTextBlockFontName(Blocks, I),
Pdf.GetTextBlockFontSize(Blocks, I),
Pdf.GetTextBlockBound(Blocks, I, 0),
Pdf.GetTextBlockBound(Blocks, I, 1)]));
Pdf.ReleaseTextBlocks(Blocks);
finally
Pdf.Free;
end;
end;
The state trap in this area catches teams during integration: SetTextExtractionArea, SetTextExtractionWordGap, and SetTextExtractionOptions are document-level persistent settings, not per-call arguments. An area restriction configured for one feature — reading just the header band for classification, say — quietly truncates every later extraction against the same document, including the layout-aware GetPageText levels. Reset extraction state between logical tasks, or scope each task to its own document handle.
The word-gap threshold is the lever for the glued-words pile from the opening triage: SetTextExtractionWordGap tells the layout engine how much horizontal space counts as a word break, measured against the page's glyph spacing. Tight tabular layouts want a smaller gap than airy marketing pages, and a per-document-class setting beats a global constant — just remember it persists like the rest of the extraction state.
Images: original streams, not screenshots
Rendering the page and cropping is the wrong way to get images out of a PDF: it resamples, bakes in rotation, and discards the originals. GetPageImageList enumerates the actual image resources the page references, and each item exposes its properties and original data:
var
ImgList, I: Integer;
begin
Pdf.SelectPage(1);
ImgList := Pdf.GetPageImageList(0);
for I := 0 to Pdf.GetImageListCount(ImgList) - 1 do
begin
Writeln(Pdf.GetImageListItemFormatDesc(ImgList, I, 0));
Pdf.SaveImageListItemDataToFile(ImgList, I, 0,
Format('page1-img%.2d.bin', [I]));
end;
Pdf.ReleaseImageList(ImgList);
end;
Inspect GetImageListItemFormatDesc before assuming anything about an item. What pages reference is rarely one clean picture per visible image: soft masks arrive as separate entries, the same XObject repeats across pages — deduplicate by content hash before archiving an "all images" export — and CMYK JPEGs need color management downstream or they render inverted in naive viewers. For a document-wide inventory rather than a per-page one, FindImages with SetFindImagesMode scans the whole file.
One boundary to communicate to stakeholders early: image extraction returns raster resources. Logos and diagrams drawn as vector paths are not images in the resource sense and will never appear in any image list — when the requirement is to deliver the chart as a picture, the honest implementation renders the page region to a bitmap instead, and the two kinds of output should not share an export folder without labeling.
Fonts: an audit surface, not an export feature
The font API answers questions about fonts; it does not hand over font files. After FindFonts scans the document, the enumeration walks by ID and the property calls describe whichever font is currently selected:
var
I: Integer;
begin
Pdf.FindFonts;
for I := 1 to Pdf.FontCount do // font indexes start at 1, not 0
if Pdf.SelectFont(Pdf.GetFontID(I)) = 1 then
Writeln(Format('%s type=%d embedded=%d subset=%d',
[Pdf.FontName, Pdf.FontType,
Pdf.GetFontIsEmbedded, Pdf.GetFontIsSubsetted]));
end;
Note the loop bounds: font indexes run from 1 to FontCount, unlike the zero-based text-block and image-list indexes a few paragraphs up — mixing the two conventions produces an off-by-one that skips the first font or reads past the last. And to be precise about scope: there is no byte-level font export in this API. No call returns the embedded font program as a TTF or OTF file; enumeration plus metadata inspection is the intended model. For production purposes that model covers what actually matters — subset detection by name pattern, embedding audits ahead of archival conversion (an unembedded font is a hard PDF/A blocker, as covered in PDF/A and PDF/UA preflight in Delphi), and encoding diagnostics when extraction confidence drops. Subset font programs are licensed material and incomplete as installable fonts anyway; treating them as audit metadata rather than extractable assets is the defensible position.
The encoding probe earns its place in triage pipelines: GetFontEncoding on each font, combined with the subset flag, predicts extraction quality before any text is pulled — a page whose fonts are all subsetted with non-standard encodings is an OCR candidate by inspection alone.
Extraction at scale without loading documents
For batch pipelines, loading a whole document to read one page is wasted I/O. The single-call variants — ExtractFilePageText and ExtractFilePageTextBlocks — take a file name, password, and page number directly. For gigabyte-scale files there is a lower gear: the direct-access path opens the file through streaming xref reads, so DAOpenFileReadOnly followed by DAExtractPageText touches only the objects one page needs. One convention shift to respect: DA functions address pages by PageRef, an object-reference handle obtained from DAFindPage, not by page number — passing the number where the handle belongs operates on the wrong object without raising an error. The wider direct-access toolkit is mapped out in large PDF merge, split, and direct access.
Common extraction questions
Why is extracted text different from what the viewer shows?
Usually encoding: ligatures decode as single glyphs, and subsetted fonts with incomplete ToUnicode maps yield substitutions or gaps. Compare extraction confidence across pages, and treat pages dominated by unmapped glyphs as OCR candidates.
Can PDFlibPas save an embedded font as a TTF or OTF file?
No. The font API enumerates and inspects — name, family, type, encoding, embedded and subset status — and that is its full scope. Plan font workflows around audit questions rather than file export.
How do I extract text from only one region of the page?
SetTextExtractionArea restricts subsequent extractions to a rectangle. Remember that it persists on the document: reset it once the regional task is done, or the next full-page extraction silently returns the region only.
Evaluation builds, demo projects, and the complete extraction API reference are on the losLab PDF Library for Delphi product page.