Article technique

PDF/UA Structure Tree Validation in Delphi with PDFium

Your preflight reports the file as PDF/UA clean. veraPDF opens the same file and flags a Figure with no alternate text under clause 7.3. Both tools are right, and the gap between them is the whole problem with checking accessibility by scanning bytes. A byte-level pass confirms the file says it is tagged: it finds the /StructTreeRoot, the /MarkInfo /Marked true, the pdfuaid:part in the XMP packet, the document title, the language. Those are format markers, and they are necessary. They tell you nothing about whether the actual figure on page four carries a description a screen reader can read aloud. That answer lives in the tag tree, and to get it you have to walk the tree.

PDFium Component is a native VCL PDF library for Delphi and C++Builder, and its ValidatePdfUa does both passes. The byte-level pass handles the format markers. On top of it sits a structure-tree pass that loads the live tagged tree, walks every element, and checks the small set of high-confidence content rules where a missing attribute means a real accessibility defect rather than a stylistic preference. This article is about that second pass: what it checks, why the rule logic is a pure function with no DLL underneath it, and where it deliberately stops short.

Why a byte scan cannot see a missing Alt

ISO 14289-1 (PDF/UA-1) is a layer of requirements on top of ISO 32000. Some of those requirements are structural and visible in the raw file: the catalog must declare a structure tree, the viewer preferences must set DisplayDocTitle, fonts must be embedded. A token scanner that strips stream bodies and matches name tokens with delimiter boundaries can verify all of those, and PDFium's ValidatePdfUaCompliance does exactly that for clauses like 7.1, 7.18, and 7.21.

But "every Figure has alternate text" is not a property of the file's syntax. It is a property of the logical structure — the tree of tagged elements that maps content to meaning. A Figure's Alt entry can sit in the structure element dictionary, be supplied through an /ActualText span, or come from a role-mapped custom type. You cannot reliably find it by grepping for /Alt in the byte stream, because that string appears in unrelated contexts, may be compressed inside an object stream, and tells you nothing about which structure element it belongs to. The honest way to answer the question is to ask the document's own structure tree, element by element, the same surface veraPDF and PAC evaluate. That is the line PDFium's Tier-1 checks are built around: byte scan for format, tree walk for content.

Reading the live tag tree

The raw material is TPdf.GetStructureElements (also surfaced as the StructureElements property), which returns a TPdfStructureElements — a flat array of TPdfStructureElement records in document order. Each record is the projection of one structure element through PDFium's accessor functions, with the fields the accessibility rules actually need:

type
  TPdfStructureElement = record
    Level: Integer;            // depth in the tag tree
    ParentIndex: Integer;      // index of parent element, or -1
    TypeName: WString;         // standard /S name: Figure, Formula, Note...
    Title: WString;            // /T
    AlternateText: WString;    // /Alt   (FPDF_StructElement_GetAltText)
    ActualText: WString;       // /ActualText
    Expansion: WString;        // /E
    ID: WString;               // /ID    (FPDF_StructElement_GetID)
    Language: WString;         // /Lang
    MarkedContentIDs: TPdfIntegerArray;
    // ... child bookkeeping fields
  end;

The TypeName field is the one the validator pivots on. It comes from FPDF_StructElement_GetType, which returns the element's standard structure type — its /S name — after PDFium has resolved the role map. AlternateText comes from FPDF_StructElement_GetAltText, ActualText from FPDF_StructElement_GetActualText, and ID from FPDF_StructElement_GetID. Because the array is flat and ordered, the validator can reason about the whole document at once instead of recursing — which matters for the one rule that is global rather than per-element.

The checker is a pure function, and that is on purpose

The rule logic does not live inside the method that talks to the DLL. It is a standalone, public, pure function:

function ValidatePdfUaStructureElements(
  const Elements: TPdfStructureElements): TPdfUaValidationIssues;

It takes a flat element array and returns a set of issues. It calls no PDFium function, opens no document, touches no global state. That separation is deliberate, and it pays off twice. First, testability: you can build a synthetic TPdfStructureElements array in a unit test — a Figure with no Alt, a Formula whose only accessible text is in ActualText, two Notes that share an ID — and assert on the result set without pdfium.dll present at all. The rule logic is verified offline; the DLL traversal is verified separately by a live-document smoke test that skips when the library is missing.

Second, clarity of responsibility. TPdf.ValidatePdfUa owns the messy part — loading each page, pulling its elements, accumulating them — and then hands a clean array to the pure checker. "Get the data" (DLL, side effects, lifetime) and "judge the rules" (pure, deterministic) never tangle. When a rule needs to change, you change a function that has no I/O in it.

What the three rules actually check

The structure-tree pass raises three issue values, appended to the end of TPdfUaValidationIssues so the enum stays ABI-stable for existing callers: pvuaiFigureMissingAlt, pvuaiFormulaMissingAlt, and pvuaiNoteMissingId. The body is small enough to reason about completely:

for I := 0 to High(Elements) do
begin
  T := string(Elements[I].TypeName);
  if T = 'Figure' then
  begin
    // §7.3 — a Figure needs an alternate representation:
    // an Alt entry OR ActualText. Flag only when BOTH are empty.
    if (Elements[I].AlternateText = '') and (Elements[I].ActualText = '') then
      Include(Result, pvuaiFigureMissingAlt);
  end
  else if T = 'Formula' then
  begin
    // §7.7 — same rule as Figure: Alt OR ActualText.
    if (Elements[I].AlternateText = '') and (Elements[I].ActualText = '') then
      Include(Result, pvuaiFormulaMissingAlt);
  end
  else if T = 'Note' then
  begin
    // §7.9 — every Note must have a unique ID.
    NoteId := string(Elements[I].ID);
    if NoteId = '' then
      Include(Result, pvuaiNoteMissingId)
    else
      for J := 0 to I - 1 do
        if (string(Elements[J].TypeName) = 'Note') and
           (string(Elements[J].ID) = NoteId) then
        begin
          Include(Result, pvuaiNoteMissingId);
          Break;
        end;
  end;
end;

Clause 7.3 governs figures: a Figure element must provide a text alternative. The early version of this check only looked at the Alt entry, which made it stricter than the reference validators. PDF/UA accepts a figure whose accessible text is supplied through ActualText instead — replacement text is a valid alternative representation — so the rule flags a Figure only when both Alt and ActualText are empty. Clause 7.7 covers formulas, and after the same correction it uses the identical Alt-or-ActualText test; a conformance-corpus sample that gave a Formula its accessible text through ActualText alone was being falsely rejected until the Formula branch was brought in line with the Figure branch.

Clause 7.9 is different in kind. A Note must have an /ID, and that ID must be unique across the document. A missing ID is a per-element failure. A duplicate ID is a relationship between two elements, which is why the flat array matters: for each Note, the checker scans backward over the elements already seen and flags a collision with any earlier Note carrying the same ID. The cost is the obvious O(n²) over Note count, which is irrelevant for any real document and keeps the function a single readable loop with no auxiliary index to keep in sync.

Accumulating across pages so uniqueness is global

PDFium exposes structure elements per page, not per document, so the orchestration in ValidatePdfUa has to gather them before the rules run. It walks each page with FPDF_LoadPage / GetStructureElementsForPage / FPDF_ClosePage, independent of whatever page the component currently has open, and appends every page's elements into one array. Only then does it call the pure checker:

// inside TPdf.ValidatePdfUa, after the byte-level pass
if (FDocument <> nil) and
   (not (pvuaiMissingStructTreeRoot in Result.Issues)) then
begin
  AllElems := nil;
  PageTotal := FPDF_GetPageCount(FDocument);
  for I := 0 to PageTotal - 1 do
  begin
    Page := FPDF_LoadPage(FDocument, I);
    if Page = nil then Continue;
    try
      PageElems := GetStructureElementsForPage(Page);
    finally
      FPDF_ClosePage(Page);
    end;
    // append PageElems into AllElems ...
  end;
  Result.Issues := Result.Issues + ValidatePdfUaStructureElements(AllElems);
end;

The accumulation is what makes the 7.9 uniqueness check correct. Two Notes on different pages can share an ID; if you validated page by page you would never see the collision, because each page's element set looks internally consistent. Building one document-wide array is the only way the duplicate becomes visible. The guard on the front is worth noting too: the tree walk runs only when the byte-level pass did not report pvuaiMissingStructTreeRoot. An untagged document has no tree to walk and has already been flagged for the missing structure root, so the per-page loads are skipped entirely. The deep pass costs nothing on the documents that cannot benefit from it.

Conservative by design: miss quietly, never cry wolf

The single most important property of this validator is what it refuses to do. It matches only the standard /S type names that FPDF_StructElement_GetType returns directly — Figure, Formula, Note. A document that defines a custom type and role-maps it to Figure will, depending on how PDFium resolves the type, report its own name. When that happens the checker does not recognize it and stays silent. That is a false negative, and it is the intended behavior. The design rule is to under-report rather than ever produce a false positive, because a preflight tool that cries wolf on conformant files trains its users to ignore it — and an ignored validator is worse than none. Decorative images live in the artifact stream, not the structure tree, so they never surface as Figures to begin with; you will not get a "missing Alt" complaint about a background rule that is correctly marked as an artifact.

This is also why the scope is held to three rules. Heading-level nesting (clause 7.4), table header scope (7.5), and role-map cycle detection (7.1) are all legitimate PDF/UA requirements, but checking them well needs real graph and attribute analysis, and checking them naively produces exactly the false positives the design forbids — PDF/UA permits heading patterns like H1, H2, H3, H3 that a simple "must strictly increase" rule would wrongly reject. Those checks are left to dedicated conformance tools. The Tier-1 set is the subset where a missing attribute is unambiguous.

The boundary, stated plainly

Two limits are worth knowing before you wire this into a release gate. First, the checker is only as good as what PDFium can read from the structure element. A handful of conformance-corpus files that the reference validators pass use an alternate-text mechanism PDFium does not surface, so FPDF_StructElement_GetAltText returns empty even though the file is genuinely conformant. The pure checker then "correctly" flags a missing Alt on incomplete data — a false positive that originates in the DLL's accessor coverage, not the rule logic. Loosening the rule to absorb those cases would also blind it to the real failures it is meant to catch, so they are documented as a known PDFium limitation rather than papered over.

Second, this is a preflight, not a certification. Tier-1 catches the high-confidence content errors that a byte scan structurally cannot, and it does so without false alarms — but full PDF/UA conformance, including heading semantics, table structure, and reading-order correctness, still belongs to a complete validator and ultimately to a human reviewer. Use ValidatePdfUa to fail the obvious defects fast and cheaply in your own pipeline, then let veraPDF or PAC have the final word. The same structure-tree traversal underpins building an accessible PDF reader in Delphi, where the tag tree drives reading order and spoken text, and it complements the metadata-level work in reviewing PDF annotations from Delphi.

The structure-tree APIs and the ValidatePdfUa validator shown here ship with the PDFium Component for Delphi and C++Builder (VCL) and Lazarus/FPC (LCL). The product page links the full API reference, including the complete TPdfStructureElement record layout and the issue enumeration behind these checks.