Technical Article

PDF/A and PDF/UA Preflight in Delphi with PDFlibPas

PDF/A and PDF/UA answer two questions that have nothing to do with each other, and treating them as one accessibility-and-archiving checkbox is how broken files reach an archive wearing a compliance label. PDF/A asks whether a file will still render faithfully in twenty years. PDF/UA asks whether assistive technology can read it today. A document can pass one cold and fail the other, so the only honest verdict comes from running both, and from running them before the file is written out, not after a downstream system trusts the conformance identifier baked into its metadata. That identifier is a self-declaration. Nothing in the format requires it to be true, and an application that writes "PDF/A-1b" into the XMP without validating against the standard produces a file that looks compliant to every consumer that reads only the label. losLab PDF Library (PDFlibPas) closes that gap for Delphi and C++Builder by building both validators into the library, so the check runs in-process with no external service to stand up.

Two standards that fail files for opposite reasons

ISO 19005 (PDF/A) is a reproduction contract. A conforming file has to render identically decades from now on software that never saw the system that produced it, so the rules attack external dependencies: every font embedded, color anchored to an embedded ICC OutputIntent or expressed in a device-independent space, no encryption in PDF/A-1, no JavaScript, XMP metadata that agrees with the document information dictionary. ISO 14289 (PDF/UA) is a semantics contract instead. Assistive technology has to traverse the document and come away with meaning, which lives in a wholly different layer: a complete structure tree, alternate text on figures, a document title set for display, heading levels that do not skip, table header relationships that survive once the page is off screen.

Because the two standards police different layers, the files that bite you are the ones that sit between them. An archive-perfect document can be silent to a screen reader. A beautifully tagged one can reference a desktop font that will not exist in ten years. Public-sector publishing is the usual place both requirements land at once, and a pipeline there cannot collapse them into a single gate. The findings go to different people. Unembedded fonts are a defect in the code that generates the PDF, while missing alternate text belongs to whoever owns the content templates, and a report that mixes the two just gets forwarded twice.

Which part of PDF/A you target matters as much as whether you hit it. PDF/A-1 is frozen on PDF 1.4 and rejects transparency and JPEG2000, both of which modern reporting output reaches for without thinking. PDF/A-2 (ISO 19005-2, built on ISO 32000-1) accepts both and is the sane default for a new archive. PDF/A-3 goes further and permits embedded files of any type, which is what regulated e-invoicing formats rely on. A team still standardizing on PDF/A-1b in 2026 is usually carrying a requirement someone wrote fifteen years ago, and renegotiating the target part is often cheaper than stripping transparency out of every chart the system emits.

Structured findings at ingestion time

The flat-API entry point is CheckFileCompliance, with the test selector 1 for PDF/A and 2 for PDF/UA. It hands back a string-list handle whose items are individual findings, one per line, which is exactly the shape an automated gate wants to walk:

function GateArchiveUpload(Pdf: TPDFlib; const FileName: string): Boolean;
var
  ListId, I: Integer;
begin
  ListId := Pdf.CheckFileCompliance(FileName, '', 1, 0);  // 1 = PDF/A
  if ListId = 0 then
  begin
    // 0 means "no findings" OR "file unreadable" -- disambiguate before passing
    Result := Pdf.LastErrorCode = 0;
    Exit;
  end;
  for I := 0 to Pdf.GetStringListCount(ListId) - 1 do
    LogFinding(FileName, Pdf.GetStringListItem(ListId, I));
  Pdf.ReleaseStringList(ListId);
  Result := False;
end;

Two details decide whether this runs unattended. The first is a return value that means two opposite things. CheckFileCompliance returns 0 when the file is fully compliant and also when the file could not be opened at all, because internally an empty result list collapses to 0 in both cases. A gate that reads 0 as a pass will wave corrupt uploads straight into the archive, so disambiguate with LastErrorCode before you trust the zero, as the gate above does. The second concerns where the file is in its lifecycle. The checker runs on the library's streaming reader instead of the full document model, opening the file directly with read sharing and never calling LoadFromFile, which is why it can chew through multi-gigabyte input without building an object tree. That same streaming open fails while another process still holds the file for writing, and an upload in progress is precisely that state. Gate after the transfer finishes.

The streaming design pays off again under load. Each check opens its input read-only and shares it for reading, so a corpus audit scales out across worker threads or processes with one TPDFlib instance per worker and no contention between them. The resource that needs discipline is the handle itself. Every non-zero result from CheckFileCompliance stays allocated until you call ReleaseStringList, and a long-running gate that forgets to release them does not crash, it just bleeds memory slowly until someone goes looking for why.

Reports for humans, diffs for build gates

A finding list is the right shape for a gate and the wrong shape for an email to the template team. CreatePreflightReport renders the same analysis as readable prose, CreatePreflightReportEx adds a report-format selector, and SavePreflightReport writes it to disk so the report can travel inside the delivered document package. Plenty of archival contracts make that report a deliverable in its own right, not just an internal artifact.

The member of this family that earns its place quietly is ComparePreflightReports. Compliance is a regression surface like any other piece of behavior. A template tweak, a newly licensed corporate font, or a library upgrade can each introduce a finding that was not there last release, and none of them announces itself. Keep golden reports for a set of representative documents under version control, regenerate them after every change, and run ComparePreflightReports to compute the delta. An empty diff is a release artifact worth keeping. A surprise finding fails the build, which is a far cheaper place to discover it than the audit.

Generating output that passes on the first run

Preflight earns its keep on files arriving from elsewhere. For documents your own code produces, finding violations after generation and patching them back in is the slow way around. PDFlibPas carries a generation-side mode for each standard, and you can switch both on for the same document:

var
  Pdf: TPDFlib;
  Diag: WideString;
begin
  Pdf := TPDFlib.Create;
  try
    Pdf.NewDocument;
    Pdf.SetPDFAMode(1);
    Pdf.LoadOutputIntentProfile('sRGB-IEC61966-2.1.icc', 'RGB');
    Pdf.SetPDFUAMode('en-US');
    Pdf.SetInformation(1, 'Quarterly Statement');  // /Title: required for PDF/UA
    // ... draw tagged content here ...
    Diag := Pdf.GetPDFUADiagnostics;
    if Diag <> '' then
      Writeln('fix before shipping: ', Diag);
    Pdf.SaveToFile('statement.pdf');
    // the preflight that counts runs on the saved file:
    Writeln(Pdf.CreatePreflightReport('statement.pdf', '', 1, 0));
  finally
    Pdf.Free;
  end;
end;

The trap hides at save time. Several of the conformance repairs happen while the document is being serialized rather than when you enable the mode: forcing the print flag on annotations, writing the default AFRelationship for PDF/A-3 embedded files, normalizing tab order and form-field descriptions for PDF/UA. The document sitting in memory is not byte-identical to the one that lands on disk, so the only preflight verdict that means anything is the one computed from the saved file. Validate statement.pdf itself. Do not infer compliance from the object still in memory, because the bytes you would be judging are not the bytes you shipped.

Invoice scenarios that carry machine-readable XML alongside the visual document follow the ZUGFeRD and Factur-X pattern, which is built on PDF/A-3. Those should set the attachment relationship explicitly with SetPDFA3DefaultAFRelationship, since ISO 19005-3 requires every embedded file to declare its role relative to the document. Leave it unset and the embedded XML is just a blob with no stated purpose, which the validator notices.

Independent referees: veraPDF and Acrobat

A producer should not be the only judge of its own output. The PDFlibPas checkers give you fast, structured verdicts in-process, which is what you want on the hot path, but the release gate for an archival batch should still pass the output through a validator nobody on your team wrote. veraPDF is the community-maintained reference implementation for PDF/A and the tool most archives name in their acceptance criteria, so it is the one to match. Acrobat's preflight profiles make a useful tiebreaker when veraPDF and the in-process check disagree. Record the validator name and its version beside every stored report. A claim that a file passed veraPDF says very little without the build number that passed it, since the tool tightens its rules between releases.

Validators do disagree at the edges of the standards, and when they do the answer is not to pick the tool you like. Shrink the file to a minimal sample that still triggers the disagreement and read it against the standard text. An hour of that usually surfaces one of two things: a genuine tool bug worth filing upstream, or a clause your team has been reading wrong and should write down in the compliance notes so the next person does not relitigate it.

Encrypted input gets a shortcut. Both checkers take a password argument, but a PDF/A-1 file with an encryption dictionary is already non-conforming, because ISO 19005-1 forbids encryption outright, so an encrypted submission can be turned away before any deeper analysis runs. Working out what an encryption dictionary actually grants is its own task, covered in PDF encryption and permissions auditing.

PDF/UA findings almost always trace back to how the structure tree was authored in the first place, and the tagging techniques behind that live in building tagged PDF structure trees in Delphi. Archives that also demand digital signatures should pair this gate with the workflow in PAdES signing and validation. The full preflight API reference lives on the losLab PDF Library for Delphi product page.