Point NVDA at a freshly built Delphi PDF viewer and you usually get one of two outcomes: silence, or text spoken in whatever order the content stream happens to store it — page footer first, then the right-hand column, then the heading that visually opens the page. The rendering is flawless; the listening experience is useless. The gap exists because rasterization and reading are separate pipelines: paint order inside a PDF content stream carries no obligation to match the order a human should hear. PDFium Component, the VCL/LCL wrapper around the PDFium engine for Delphi, C++Builder, and Lazarus, ships a dedicated family of reading APIs precisely because the rendering APIs cannot do this job.
Three problems decide whether an accessible reader project succeeds: extracting a speakable reading order, keeping a visible word cursor synchronized with speech output, and degrading honestly when the document was never tagged. Each has a concrete API path and an equally concrete failure mode worth knowing before you write the first event handler.
Reading order lives in the structure tree, not the paint order
ISO 32000-1 §14.8 defines logical structure as a tree of structure elements layered over the page content, and PDF/UA (ISO 14289-1) makes that tree mandatory: every piece of real content must be reachable through it in reading order, with artifacts excluded. A correctly tagged report knows that "Quarterly Results" is a level-two heading and that the totals grid is a table with header cells. An untagged report is positioned glyph runs and nothing more.
ReadablePageContent walks this structure when it exists and returns content fragments labeled with a semantic Kind — cfHeading, cfParagraph, and related values — so the reader UI can announce "heading" before the text instead of reading a bold line as if it were body copy. When the structure tree is absent or unusable, the same call switches to heuristic layout analysis: column detection, baseline clustering, left-to-right ordering. The output is often serviceable for single-column documents and unreliable for newsletters, multi-column forms, and anything with sidebars. The crucial discipline is to tell the user which case they are in — and the API hands you that fact directly: the returned TPdfReadableContent record carries a Source field that is rosStructure when the order came from the tagged tree and rosHeuristic when it was guessed from layout. Presenting guessed order as verified order is the accessibility equivalent of a green checkmark on an untested build.
A practical way to classify a file at open time is to check IsTagged and run ValidatePdfUa once, caching the verdict. A failed PDF/UA check does not mean rejecting the document — it means the status bar shows "estimated reading order", and your support team knows exactly what they are looking at when a customer reports nonsense narration on a specific file.
From page to speech queue with ReadingUnits
For text-to-speech the workhorse is ReadingUnits: it returns an array of TPdfReadingUnit records for the active page, each carrying the text to speak, its semantic role, and the highlight rectangles that locate it on the page. (A document-wide variant, DocumentReadingUnits, exists for continuous reading.) One unit maps naturally onto one entry in a speech queue:
procedure TReaderForm.QueuePageSpeech(PageNumber: Integer);
var
Units: TPdfReadingUnits;
i: Integer;
begin
Pdf.PageNumber := PageNumber; // ReadingUnits works on the active page
Units := Pdf.ReadingUnits;
FSpeechQueue.Clear;
for i := Low(Units) to High(Units) do
FSpeechQueue.Add(Units[i]); // text + semantics + highlight rects
FCurrentPage := PageNumber;
SpeakNextUnit;
end;
Two details in this loop repay attention. First, keep the queue strictly per page and rebuild it on navigation: reading units hold page-space rectangles, so a stale queue paints highlights onto the wrong page after the user jumps ahead. Second, an empty Units array on a page that visibly contains content is your image-only detector. A scanned page has pixels but no text layer, and the correct response is a spoken warning — "this page contains no extractable text" — rather than silence that the user cannot distinguish from a crash.
A word cursor that follows the voice
Block-level highlighting feels sluggish to low-vision users who follow along visually while listening. Word-level ("karaoke") highlighting needs two ingredients: word geometry, and a mapping from the TTS engine's progress callbacks onto that geometry. PageWordBoxes supplies the geometry as TPdfWordBox records — word text, character offset, character count, and a page-space rectangle. TrackReadingWordAt supplies the mapping: it converts a character position, which is exactly what SAPI's word-boundary notification hands you, into an index into the word-box array — and highlights the word containing it in the same call.
procedure TReaderForm.PrepareKaraoke(PageNumber: Integer);
begin
// The view's word boxes come from the page the view displays —
// setting Pdf.PageNumber alone would not move the view
PdfView.PageNumber := PageNumber;
FWordBoxes := PdfView.PageWordBoxes;
end;
procedure TReaderForm.OnTtsWordBoundary(Sender: TObject; CharIndex: Integer);
var
WordIdx: Integer;
begin
// TrackReadingWordAt maps the offset AND paints the word cursor
WordIdx := PdfView.TrackReadingWordAt(FCurrentPage, CharIndex);
if WordIdx < 0 then
PdfView.ClearReadingWord; // boundary ran past the page text
end;
The contract is forgiving in one way and strict in another. Forgiving: TrackReadingWordAt maintains its own word-box cache for the tracked page, so you do not have to pre-feed it — and no rendering is involved, because word boxes derive from the page's text layer, which means even a headless speech service can track positions. Strict: the character index must refer to the text the component extracted. The function also returns -1 instead of raising when CharIndex points past the end of the page text, which happens routinely when a TTS engine fires a final boundary event for trailing punctuation. Treat -1 as "clear the cursor", not as an error condition.
On the display side, ReadingWordColor controls the cursor highlight — the default amber survives most page backgrounds, but check it under every display filter your viewer offers, because an amber cursor can disappear entirely under color inversion, and inversion plus speech is precisely the combination low-vision users run. Setting ReadingWordFollow to True makes the view scroll the spoken word into visibility automatically, which is essential on zoomed multi-screen pages. One scope rule: SetReadingWord paints on the active TPdfView page only, so decide whether user scrolling pauses speech or the follow behavior wins; doing neither leaves speech running against an invisible cursor.
Documents that fight back
Three input classes break naive implementations often enough to deserve permanent regression samples in the test suite.
- Untagged but text-rich files. Heuristic order is usually right for linear reports and wrong for layouts with sidebars or pull quotes. Label the order as estimated in the UI and in your diagnostics log.
- Image-only scans. No text layer at all. Detect them through empty reading units and route the user toward an OCR step upstream instead of letting the reader speak nothing.
- Combining characters and mixed scripts. Unicode combining marks do not always map one-to-one onto visual words, so the word-box count can differ from what your own tokenizer would predict. Never index the word-box array with arithmetic derived from your own splitting; use only indices returned by
TrackReadingWordAt.
Acceptance: test like an auditor, not like a demo
"It read my sample aloud" is not acceptance. A defensible pass runs three documents through the finished build with NVDA attached: a known-tagged file (headings announced as headings, table read in row order), a known-untagged file (the estimated-order indicator visible), and a scan (explicit no-text warning spoken).
Then verify the word cursor stays attached at double and half speech rate, and that ReadingWordFollow scrolling does not fight manual scrolling. Finally, toggle every color filter while speech runs and confirm the cursor remains visible — the low-vision color filter article covers that rendering path, and the word speech cursor deep dive goes further into TTS timing details.
FAQ
Does the reader require a tagged PDF to work at all?
No. ReadablePageContent and ReadingUnits fall back to heuristic layout analysis on untagged files, and the Source field of the readable content tells you which path produced the order. The burden falls on your UI: distinguish verified structure-tree order from estimated order, because the two fail in different ways and support needs to know which one a complaint is about.
Why does TrackReadingWordAt return -1 in the middle of a page?
Usually the character index from your TTS engine refers to text you preprocessed before queuing, or it landed on inter-word whitespace. Offsets must point into the text the component extracted — the same text PageWordBoxes tokenized — not into a cleaned-up copy of it.
Can I check accessibility compliance programmatically?
Yes — ValidatePdfUa returns the detected conformance level plus a set of PDF/UA violations per document, and BuildPdfPreflightReport folds the same check into a multi-standard report. It is a detector, not a repair tool: use the verdict to set user expectations at open time and to triage incoming files.
The reading-unit and word-box APIs shown here are part of PDFium Component for Delphi and C++Builder (VCL) and Lazarus/FPC (LCL). The product page links the full API reference, including the record layouts for reading units and word boxes used in the examples above.