The first demo of our read-aloud feature for a literacy app went fine for two paragraphs. Then the page hit a drop cap, the voice said "Chapter" while the highlight sat on the previous line, and by the bottom of the page the cursor trailed the audio by three words. The voice was never the problem — SAPI reported word boundaries precisely. The problem was the mapping layer between character offsets in the speech buffer and rectangles on a rendered PDF page, and that layer is where every karaoke-style highlighter lives or dies. PDFium Component (word boxes since v1.53, the tracker and highlight cursor since v1.56) ships that mapping for Delphi, C++Builder, and Lazarus as a small, deliberate API: word boxes, an offset-to-word tracker, and a highlight cursor with auto-scroll. Used in the right order it is robust; used in the wrong order it produces exactly the drift we demoed.
Characters are not words, and TTS engines speak in characters
A speech engine consumes a flat string and reports progress as character positions within that string. A PDF page, meanwhile, has glyphs placed in page space, where a "word" is a heuristic cluster of glyph runs. The two coordinate systems share nothing unless the text you hand the synthesizer is byte-for-byte the text the word boxes were computed from. That is rule one, and it is unforgiving: normalize whitespace, strip soft hyphens, or otherwise "clean up" the extracted text before speaking it, and every downstream offset is silently wrong. Speak exactly what you extracted, or maintain an explicit offset remapping table — there is no third option that survives real documents.
The remapping option is not hypothetical. If your UI inserts spoken page announcements ("page five") or expands abbreviations for the synthesizer, record each insertion's position and length, and subtract the accumulated adjustment before every tracking call. It is twenty lines of bookkeeping, and it is the difference between a highlight that survives feature growth and one that breaks the first time product asks for spoken headings.
What a word box gives you
Each TPdfWordBox record carries the word's text, its StartIndex and character Count within the page text, a page-space Rect, and the 1-based Page number. PageWordBoxes returns the full array for the active page:
procedure TReaderForm.PreparePage(PageNo: Integer);
begin
PdfView.PageNumber := PageNo; // the view's word boxes track its displayed page
FWords := PdfView.PageWordBoxes;
FPageText := BuildSpeechText(FWords); // concatenate Word.Text in order
if Length(FWords) = 0 then
HandleImageOnlyPage(PageNo); // a scan with no text layer
end;
The ordering comment is load-bearing: the viewer's PageWordBoxes tokenizes the text layer of the page the view currently displays, so navigate the view first, then extract — no rendering is required, only an open document. (The document component offers its own PageWordBoxes keyed to Pdf.PageNumber for headless use.) An empty result on a page that visibly carries content means an image-only scan — route it to OCR or skip it audibly ("page 4 contains no readable text") rather than letting the voice fall silent without explanation.
Wiring SAPI word boundaries to the tracker
TrackReadingWordAt, on the viewer, is the hinge of the whole feature: give it a page number and a character index, and it finds the word box containing that character, paints the reading cursor on it, and returns the word index — or −1. SAPI's word-boundary notification supplies exactly the character position you need:
procedure TReaderForm.OnSpeechWordBoundary(StreamPos: Integer);
var
WordIdx: Integer;
begin
// Maps the offset to a word box and moves the highlight in one call
WordIdx := PdfView.TrackReadingWordAt(FPageNo, StreamPos);
if WordIdx < 0 then
Exit; // boundary fell outside any word: keep last highlight
end;
Two defensive details. TrackReadingWordAt keeps its own word-box cache for the tracked page (rebuilt automatically when the page changes), so the per-boundary cost stays flat — and it does not bounds-check generously: an index at or beyond the page's character count returns −1 rather than clamping to the final word. Treat −1 as "hold the previous highlight," never as an error, because punctuation runs and inter-word whitespace legitimately produce boundaries that belong to no word. If you log each −1 you will drown; count them per page instead, and investigate pages where the ratio spikes — that usually signals a text-normalization mismatch from rule one.
The cursor itself: color, follow, and cleanup
SetReadingWord paints the highlight directly when you hold the word box yourself, ReadingWordColor styles it, and ReadingWordFollow := True scrolls the view just enough to keep the spoken word visible. That last property matters more than it sounds: a hand-rolled "center the current word" scroll makes the page lurch on every line break, and motion-sensitive readers will switch the feature off within a minute. The highlight renders only on the page currently shown in the active TPdfView, so multi-page reading must advance PageNumber in step with speech — and re-run the prepare step for the new page before its first boundary event arrives, so the speech text and offsets line up with the fresh page.
procedure TReaderForm.StopReading;
begin
FVoice.Stop; // halt SAPI playback first
PdfView.ClearReadingWord; // then remove the highlight; a stale cursor reads as a bug
end;
Symmetry matters on shutdown: every pause, stop, and page-turn path must end in ClearReadingWord. The single most-reported "bug" in our beta was an amber rectangle left behind on a paused page — harmless, but every tester filed it.
Speech rate stresses this pipeline more than document size does. At 300 words per minute, boundary events arrive every 200 ms; at the fastest SAPI rates, faster than the eye comfortably tracks. Coalesce rather than queue: if a new boundary arrives while a highlight update is still pending, drop the stale one. A cursor that visits every word in order but half a second late feels broken; one that occasionally skips a word yet stays synchronized does not.
Edge cases that separate demos from products
Three categories recur. Combining characters: Unicode sequences such as base letters with combining diacritics may occupy more character indices than the visual word suggests, so offset arithmetic that assumes one index per visible glyph drifts — one more reason to let TrackReadingWordAt do the mapping instead of computing word numbers yourself. Hyphenation: a word broken across a line break becomes two boxes; if you speak it as one token, the boundary event for its second half resolves to the first box — acceptable, but decide it on purpose. And tagged versus untagged documents: word sequencing follows the document's logical structure when proper tagging exists (the territory of ISO 14289, PDF/UA) and falls back to layout heuristics otherwise, so a two-column untagged page may read straight across both columns. Rotated pages add a fourth: each word's Rect still bounds it correctly in page space, but a viewport-follow policy tuned for horizontal flow scrolls jarringly when text runs vertically, so keep at least one rotated document in the regression set. For reading-order handling, sentence-level units via ReadingUnits, and the wider assistive stack, see building an accessible PDF reader in Delphi.
One platform note: SAPI is Windows-only. The word-box and tracking API is identical under Lazarus/FPC, but Linux and macOS builds need a different synthesizer behind the same boundary events — the setup differences are covered in running the viewer under Lazarus and FPC. Highlight rendering cost also interacts with your page cache at high speech rates; the budget arithmetic in render caching and zoom performance applies unchanged here.
Frequently asked questions
Why does TrackReadingWordAt always return −1?
Usually one of three causes: the page number passed in is out of range or the document is not active, the text given to the TTS engine differs from the extracted page text so offsets do not line up, or the character index belongs to whitespace between words. Check them in that order.
Why does the highlight stop updating after a page turn?
The reading cursor draws only on the active view's current page. Advance PageNumber and re-fetch PageWordBoxes for the speech text before resuming, so the boundary offsets refer to the page now on screen.
Can I highlight whole sentences instead of single words?
Yes — ReadingUnits returns sentence- and block-level units with their own highlight rectangles (paint them with SetReadingHighlight), which suits slower listeners and reduces visual churn at high speech rates.
Version requirements (v1.53 or later for word boxes, v1.56 for the tracking cursor), the full reading API, and a working read-aloud demo are on the product page: PDFium Component.