Technical Article

Build an Accessible PDF Reader in Delphi with PDFium

Point NVDA at a freshly built Delphi PDF viewer and you typically get one of two outcomes: silence, or text spoken in whatever order the content stream happens to store it — page footer first, then the right-hand column, then the heading that visually opens the page. The rendering is flawless; the listening experience is useless. The gap exists because rasterization and reading are separate pipelines: paint order within a PDF content stream carries no obligation to match the order a human should hear. PDFium Component, the VCL/LCL wrapper around the PDFium engine for Delphi, C++Builder, and Lazarus, ships a dedicated family of reading APIs precisely because the rendering APIs cannot do this job.

UK teams should align this pdfium component accessible reader delphi workflow with local governance, audit, and data quality requirements before production release

Three problems decide whether an accessible reader project succeeds: extracting a speakable reading order, keeping a visible word cursor synchronised with speech output, and degrading honestly when the document was never tagged. Each has a concrete API path and an equally concrete failure mode worth knowing before you write the first event handler.

Reading order lives in the structure tree, not the paint order

ISO 32000-1 §14.8 defines logical structure as a tree of structure elements layered over the page content, and PDF/UA (ISO 14289-1) makes that tree mandatory: every piece of real content must be reachable through it in reading order, with artefacts excluded. A correctly tagged report knows that 'Quarterly Results' is a level-two heading and that the totals grid is a table with header cells. An untagged report is positioned glyph runs and nothing more.

ReadablePageContent walks this structure when it exists and returns content fragments labeled with a semantic KindcfHeading, cfParagraph, and related values — so the reader UI can announce 'heading' before the text instead of reading a bold line as if it were body copy. When the structure tree is absent or unusable, the same call switches to heuristic layout analysis: column detection, baseline clustering, left-to-right ordering. The output is often serviceable for single-column documents and unreliable for newsletters, multi-column forms, and anything with sidebars. The crucial discipline is to tell the user which case they are in — and the API hands you that fact directly: the returned TPdfReadableContent record carries a Source field that is rosStructure when the order came from the tagged tree and rosHeuristic when it was guessed from layout. Presenting guessed order as verified order is the accessibility equivalent of a green tick on an untested build.

A practical way to classify a file at open time is to check IsTagged and run ValidatePdfUa once, caching the verdict. A failed PDF/UA check does not mean rejecting the document — it means the status bar shows 'estimated reading order', and your support team knows exactly what they are looking at when a customer reports nonsense narration on a specific file.

From page to speech queue with ReadingUnits

For text-to-speech the workhorse is ReadingUnits: it returns an array of TPdfReadingUnit records for the active page, each carrying the text to speak, its semantic role, and the highlight rectangles that locate it on the page. (A document-wide variant, DocumentReadingUnits, exists for continuous reading.) One unit maps naturally onto one entry in a speech queue:

procedure TReaderForm.QueuePageSpeech(PageNumber: Integer);
var
  Units: TPdfReadingUnits;
  i: Integer;
begin
  Pdf.PageNumber := PageNumber;   // ReadingUnits works on the active page
  Units := Pdf.ReadingUnits;
  FSpeechQueue.Clear;
  for i := Low(Units) to High(Units) do
    FSpeechQueue.Add(Units[i]);  // text + semantics + highlight rects
  FCurrentPage := PageNumber;
  SpeakNextUnit;
end;

Two details in this loop repay attention. Firstly, keep the queue strictly per page and rebuild it on navigation: reading units hold page-space rectangles, so a stale queue paints highlights onto the wrong page after the user jumps ahead. Secondly, an empty Units array on a page that visibly contains content is your image-only detector. A scanned page has pixels but no text layer, and the correct response is a spoken warning — 'this page contains no extractable text' — instead of silence that the user cannot distinguish from a crash.

A word cursor that follows the voice

Block-level highlighting feels sluggish to low-vision users who follow along visually whilst listening. Word-level ('karaoke') highlighting needs two ingredients: word geometry, and a mapping from the TTS engine's progress callbacks onto that geometry. PageWordBoxes supplies the geometry as TPdfWordBox records — word text, character offset, character count, and a page-space rectangle. TrackReadingWordAt supplies the mapping: it converts a character position, which is exactly what SAPI's word-boundary notification hands you, into an index into the word-box array — and highlights the word containing it in the same call.

procedure TReaderForm.PrepareKaraoke(PageNumber: Integer);
begin
  // The view's word boxes come from the page the view displays —
  // setting Pdf.PageNumber alone would not move the view
  PdfView.PageNumber := PageNumber;
  FWordBoxes := PdfView.PageWordBoxes;
end;

procedure TReaderForm.OnTtsWordBoundary(Sender: TObject; CharIndex: Integer);
var
  WordIdx: Integer;
begin
  // TrackReadingWordAt maps the offset AND paints the word cursor
  WordIdx := PdfView.TrackReadingWordAt(FCurrentPage, CharIndex);
  if WordIdx < 0 then
    PdfView.ClearReadingWord;  // boundary ran past the page text
end;

The contract is forgiving in one way and strict in another. Forgiving: TrackReadingWordAt maintains its own word-box cache for the tracked page, so you do not have to pre-feed it — and no rendering is involved, because word boxes derive from the page's text layer, which means even a headless speech service can track positions. Strict: the character index must refer to the text the component extracted. The function also returns -1 instead of raising when CharIndex points past the end of the page text, which happens routinely when a TTS engine fires a final boundary event for trailing punctuation. Treat -1 as 'clear the cursor', not as an error condition.

On the display side, ReadingWordColor controls the cursor highlight — the default amber survives most page backgrounds, but check it under every display filter your viewer offers, because an amber cursor can disappear entirely under colour inversion, and inversion plus speech is precisely the combination low-vision users run. Setting ReadingWordFollow to True makes the view scroll the spoken word into visibility automatically, which is essential on zoomed multi-screen pages. One scope rule: SetReadingWord paints on the active TPdfView page only, so decide whether user scrolling pauses speech or the follow behaviour wins; doing neither leaves speech running against an invisible cursor.

Documents that fight back

Three input classes break naive implementations often enough to deserve permanent regression samples in the test suite.

  • Untagged but text-rich files. Heuristic order is typically right for linear reports and wrong for layouts with sidebars or pull quotes. Label the order as estimated in the UI and in your diagnostics log.
  • Image-only scans. No text layer at all. Detect them through empty reading units and route the user towards an OCR step upstream instead of letting the reader speak nothing.
  • Combining characters and mixed scripts. Unicode combining marks do not always map one-to-one onto visual words, so the word-box count can differ from what your own tokenizer would predict. Never index the word-box array with arithmetic derived from your own splitting; use only indices returned by TrackReadingWordAt.

Acceptance: test like an auditor, not like a demo

'It read my example aloud' is not acceptance. A defensible pass runs three documents through the finished build with NVDA attached: a known-tagged file (headings announced as headings, table read in row order), a known-untagged file (the estimated-order indicator visible), and a scan (explicit no-text warning spoken).

Then verify the word cursor stays attached at double and half speech rate, and that ReadingWordFollow scrolling does not fight manual scrolling. Finally, toggle every colour filter whilst speech runs and confirm the cursor remains visible — the low-vision colour filter article covers that rendering path, and the word speech cursor deep dive goes further into TTS timing details.

FAQ

Does the reader require a tagged PDF to work at all?

No. ReadablePageContent and ReadingUnits fall back to heuristic layout analysis on untagged files, and the Source field of the readable content tells you which path produced the order. The burden falls on your UI: distinguish verified structure-tree order from estimated order, because the two fail in different ways and support needs to know which one a complaint is about.

Why does TrackReadingWordAt return -1 in the middle of a page?

Typically the character index from your TTS engine refers to text you preprocessed before queuing, or it landed on inter-word whitespace. Offsets must point into the text the component extracted — the same text PageWordBoxes tokenized — not into a cleaned-up copy of it.

Can I check accessibility compliance programmatically?

Yes — ValidatePdfUa returns the detected conformance level plus a set of PDF/UA violations per document, and BuildPdfPreflightReport folds the same check into a multi-standard report. It is a detector, not a repair tool: use the verdict to set user expectations at open time and to triage incoming files.

The reading-unit and word-box APIs shown here are part of PDFium Component for Delphi and C++Builder (VCL) and Lazarus/FPC (LCL). The product page links the full API reference, including the record layouts for reading units and word boxes used in the examples above.