技术文章

PDF/A Preflight Validation in Delphi with PDFium VCL

An archive ingestion gate rejected a batch of "PDF/A-2b" files that opened fine in every viewer on the desk. The supplier swore they were conformant. They were not: each one carried a JavaScript action buried in the catalog, the kind of thing a casual eyeball never catches and a full PDF/A validator like veraPDF flags in a heartbeat. The catch is that nobody wanted to bolt a Java toolchain onto a Delphi batch service just to answer one yes-or-no question per file. That is the gap ValidatePdfACompliance in PDFium Component fills, and it is worth understanding how it reaches a verdict without ever fully parsing a content stream.

Why PDFium itself cannot answer this

The first thing to be honest about: the bundled pdfium.dll has no PDF/A capability at all. There is no ConvertToPDFA, no OutputIntent writer, no XMP API in the public surface. Every part of PDF/A in this library, both the writing side and the checking side, lives in pure Pascal in FPdfPdfa.pas and works by byte-level parsing plus incremental update. So when you call the validator you are not asking Chromium's renderer anything. You are running a Pascal token scanner over the file's structural bytes.

The public API is deliberately small. One function reads a stream from position 0 and returns a record:

function ValidatePdfACompliance(Source: TStream): TPdfAValidationResult;

type
  TPdfAValidationResult = record
    Conformance: TPdfAConformance;        // pacUnknown, pacNone, pac1b, pac2u, ...
    Issues: TPdfAValidationIssues;        // a set of TPdfAValidationIssue
    function IsCompliant: Boolean;        // True only when level <> unknown/none
  end;                                    // AND Issues is empty

IsCompliant encodes the rule that matters in a gate: a file is a pass only when a real conformance level was detected and the issue set is empty. A parse that succeeds but finds no pdfaid marker resolves to pacNone, which is explicitly not a pass. This is the same point the batch preflight report CLI makes from the outside: an empty findings list on an unrecognized file is not a clean bill of health.

Stripping stream bodies before any token scan

Here is the single most important implementation detail, and the one that is easiest to get wrong if you write your own scanner. The detector finds violations by searching for delimited name tokens, things like /JavaScript, /LZWDecode, /BM. If you scan the raw file bytes, the embedded binary stream bodies, compressed images, ICC profiles, font programs, will randomly contain byte sequences that look like those tokens. You will report /AA or /3D "found" because three bytes inside a JPEG happened to spell it. That is a false positive factory.

The fix is PdfStructureBytes: it walks the file and blanks the bytes between every stream and endstream keyword to spaces, leaving the dictionary structure intact. Only after that does the scan run. Every name-token check in the validator operates on this stripped copy. If you take one idea away from this article, take that one. The same discipline is mirrored in the PDF/UA validator, which keeps its own copy of the routine because the two standards evolve independently.

The 29 issues and what each one means

TPdfAValidationIssue is a documented contract. The ordinals are frozen because DUnitX tests, the demos, and the report layer all depend on them, so new findings are only ever appended to the end. As of v1.63.0 there are 29 members. They fall into a few families:

  • Metadata and identity: pvaiMissingXmpMetadata, pvaiMissingPdfAIdentifier, pvaiMissingTrailerId (ISO 19005-1 6.1.3), pvaiMissingXmpDates.
  • Color and output: pvaiMissingOutputIntent, pvaiMissingIccProfile, and pvaiMixedDeviceColorSpaces when both DeviceRGB and DeviceCMYK appear (6.2.3.3).
  • Hard prohibitions for every part: pvaiEncryptionPresent (an /Encrypt dictionary is forbidden outright), pvaiJavaScriptPresent, pvaiForbiddenAction, pvaiAdditionalActions, pvaiLzwUsed, pvaiXfaPresent, pvaiNeedAppearancesTrue, pvaiForbiddenAnnotation.
  • Fonts: pvaiFontNotEmbedded and the stricter pvaiUnembeddedFont, plus pvaiUnicodeMappingMissing for a Level U claim without /ToUnicode.
  • Tagging: pvaiLevelAStructureMissing when a conformance=A claim has no tagged structure.

The six newest members, added at ordinals 24 through 29, cover the subtle cases reviewers actually trip over: pvaiTrappedTrue (a /Trapped /True in the Info dictionary, a "false friend" since the value must be False or Unknown), pvaiForbiddenActionSubtype (Sound or Movie used as an action, not just an annotation), pvaiTransparentColorSpace (a non-Normal blend mode or a /CA//ca not equal to 1.0), pvaiAnnotationDictViolation, pvaiUnembeddedFont, and pvaiMixedDeviceColorSpaces.

Part-aware gating: A-1 is strict, A-2 and A-3 relax

PDF/A is not one rulebook. Three things that PDF/A-1 forbids are explicitly permitted from PDF/A-2 onward: transparency (a /Transparency group or an active /SMask, 6.4), optional content (/OCProperties, 6.1.13), and embedded files (/EmbeddedFiles or /EF, 6.1.11). A naive validator that flags all three for every file will reject perfectly valid PDF/A-2 documents en masse.

So the validator reads the part number from the pdfaid marker through PdfAPartOf and gates those checks behind PartNo = 1. The blend-mode and annotation-alpha checks for the new transparency issues are similarly part-1 only:

if PartNo = 1 then
begin
  if PdfHasName(Struct, '/BM') then
    if not PdfHasBMNormal(Struct) then          // only /Normal or /Compatible allowed
      Include(Result.Issues, pvaiTransparentColorSpace);
  if PdfHasCaNotOne(Struct, '/CA') or PdfHasCaNotOne(Struct, '/ca') then
    Include(Result.Issues, pvaiTransparentColorSpace);
end;

One conservative default deserves a mention: when there is no pdfaid marker at all, the part is treated as 1, the strictest. The reasoning is that an unidentified file should be held to the tightest rules rather than waved through. JavaScript, forbidden actions, LZW, XFA, NeedAppearances, forbidden annotations, and unembedded fonts stay forbidden for every part, so those checks never sit behind the gate.

Expanding object streams so nothing hides

PDF 1.5 introduced the cross-reference stream and the object stream (/Type /ObjStm), and they create a blind spot for a naive byte scanner. A catalog, an OutputIntent, an action dictionary, anything that is not itself a stream, can be Flate-compressed inside an ObjStm. Scan the raw structure and you will see none of it, then report a clean file that is anything but.

PdfExpandObjectStreams closes that gap. Before any check runs, the validator does Data := PdfExpandObjectStreams(Data). The routine finds every ObjStm, reads its /N and /First header to get the contained object numbers and offsets, inflates the body with PdfInflate (the RTL zlib, System.ZLib on Delphi and zstream on FPC), and appends each contained object as an ordinary N 0 obj ... endobj to the end of a copy of the bytes. The existing token checks then find those objects with no change to their logic.

Two constraints make this clean rather than fragile. Stream objects, the Metadata, ICC profile, and font programs, cannot live in an object stream, only non-stream dictionaries can, so expansion only ever deals with dictionaries and the appended objects carry no stream keyword to disturb the body-stripping pass. And because the appended content lands after %%EOF, the reverse search from startxref still finds the original trailer. The cross-reference stream trailer itself was already handled earlier, in v1.49.3, by reading Root, Size, and ID straight from the plaintext xref-stream dictionary, a topic explored in the companion piece on validating object and cross-reference streams; the object-stream work only had to add the inflate step, with no need to decode type-2 xref entries or unwind a PNG predictor.

The honest limits of a byte-level checker

This is a preflight tool, not a certified validator, and the boundaries are real. Font embedding is a counting heuristic, and getting it right took a correction worth knowing about. The original check used PdfCountName('/FontDescriptor'), but each font contributes two /FontDescriptor tokens, one reference from the font dictionary and one /Type in the descriptor object itself, so the count was 2N against N embedded programs and the test was always true. The fix is PdfCountDescriptorRefs, which counts only the /FontDescriptor N G R reference form, one per font, and raises pvaiUnembeddedFont only when embedded programs are genuinely fewer:

K := PdfCountDescriptorRefs(Struct);                 // one per font dict
Emb := PdfCountName(Struct, '/FontFile')
     + PdfCountName(Struct, '/FontFile2')
     + PdfCountName(Struct, '/FontFile3');
if (K > 0) and (Emb < K) then
  Include(Result.Issues, pvaiUnembeddedFont);

Even corrected, it is coarse: a mixed document where every descriptor happens to have some FontFile can still let an individual non-conformant font slip past. Expanding object streams has a known side effect too, it exposes the standard-14 default resources an AcroForm /DR carries, such as /Helv, and the heuristic dutifully reports them as not embedded even though veraPDF lets them through because they are never actually used to render. Content-stream operator-level checks (6.2.10) are out of scope entirely, since they would need full content parsing rather than a byte scan. Treat the validator as a fast, dependency-free first gate that catches the violations marker injection cannot fix, and reserve a full validator for final certification.

This is the checking half of the story. The complementary writing side, where SaveAsPdfA injects the XMP, OutputIntent, and sRGB ICC profile and honestly downgrades a Level A request that has no tagged structure, builds on the same byte-level machinery. Both halves ship in the PDFium Component for Delphi, a single VCL package over a pure-Pascal PDF/A implementation with no external runtime to install.