A scanning bureau I consulted for ran a nightly job that stamped thousands of digitized records "archive ready." Six months in, an external auditor sampled the archive with veraPDF and found PDF/A violations in files the job had waved through. The job had not lied, exactly — it had checked the wrong profile, collapsed every outcome into a single pass/fail bit, and discarded the report files that would have exposed the discrepancy immediately. Keep that incident in mind when you wire the preflight engine of PDFium Component, a source-code PDF library for Delphi, C++Builder, and Lazarus, into a command-line tool: the validation calls are the easy part, and the contract around them — profiles, exit codes, report retention — is where batch preflight succeeds or quietly rots.
The contract: what a scheduler can actually see
A CI runner or Windows Task Scheduler sees exactly two things from your tool: the exit code and whatever files it left behind. Everything else — log lines, console colors, progress output — is for humans watching live. So before touching the API, fix the exit-code vocabulary and keep it boring:
0— every file conformed to every requested profile1— at least one file produced validation findings2— the tool itself failed on at least one file (corrupt input, lock, crash)
The distinction between codes 1 and 2 is the one teams skip and regret. A corrupt PDF that cannot be opened is not a validation failure, and treating it as one means a truckload of damaged scans shows up in your metrics as a sudden conformance collapse rather than as the operational incident it actually is.
Two more contract items deserve a flag each: a per-file timeout and a quarantine directory. A pathological PDF — thousands of pages, deeply nested object structures — can hold a validation pass for minutes, and a nightly window does not care why. Kill the file's job at the deadline, count it as a tool failure, move the input aside for inspection, and keep the batch moving. The quarantine directory then doubles as a self-collecting corpus of the worst documents your customers actually send, which is worth more to release testing than any synthetic sample.
Choosing standards: PDF/A-2b is not PDF/A-3a
The TPdfPreflightStandard enumeration selects the standard families that matter in practice: ppsPdfA for ISO 19005 archival conformance, ppsPdfUa for ISO 14289 accessibility, ppsPdfX for print exchange, plus ppsPdfE, ppsPdfR, and ppsPdfVT for engineering, raster, and variable-data workflows. Within a family, the engine detects the conformance level the document claims and reports it per standard in the result's ConformanceName — and the level matters. PDF/A-2b asserts visual reproducibility only; PDF/A-3a additionally demands logical structure tagging and permits embedded source files — a stricter and much harder bar for scanned material. If your retention policy says PDF/A-2b, failing files for missing structure tags floods the report with findings nobody intends to fix; accepting any PDF/A label without checking the level under-promises. Government accessibility mandates increasingly add PDF/UA on top, which costs nothing extra to include because BuildPdfPreflightReport (from the FPdfPreflightReport unit) accepts a set:
Report := BuildPdfPreflightReport(Pdf, [ppsPdfA, ppsPdfUa]);
One call, both standards, one consolidated report record.
Reading the report: silence is not conformance
The report enumerates findings per standard. An empty issue list therefore means "no problems found in the standards that ran" — which is not the same statement as "the file conforms to the standard you care about." If a configuration typo dropped ppsPdfA from the set, the issue list is just as empty. Always walk Report.Results and assert two things per standard you intended: that a result entry for it exists at all, and that its IsCompliant flag (backed by Status = pfsPass) is true. This is precisely the failure mode behind the audit story above: the job equated "no findings" with "archive ready" without ever checking which standards had been evaluated.
The second trap hides in what a finding is: each TPdfPreflightIssue carries a Code, a Category, a Description, and a Recommendation — it names the rule violated, not a page number. That shapes the feedback loop: the report tells the producing team what class of defect to fix (an unembedded font, a missing XMP identifier), and locating the specific offending object is the remediation tool's job, not the validator's. Write your report consumers against the stable Code values rather than parsing description text that may be reworded between releases.
Report files for machines and for the person on call
The report record writes the same findings in five formats — SaveJsonToFile, SaveCsvToFile, SaveHtmlToFile, SaveTextToFile, and SaveMarkdownToFile (with matching ToJson-style functions when you want the string instead of a file). Resist picking just one. JSON is for the pipeline: attach it to the job record and let CI parse issue codes and per-standard statuses. HTML is for the operator who gets paged: it opens in any browser with no tooling. Writing both costs one extra line per file and removes the worst on-call experience in batch processing, which is reverse-engineering a JSON blob at two in the morning. Keep the naming deterministic — derive each report name from the input file name, never from a timestamp — or parallel runs will interleave reports you cannot match back to their inputs.
Severity thresholds belong in configuration, not code. The same finding — an annotation without an alternate description, say — is a hard failure for a PDF/UA submission portal and an ignorable note for an internal archive. Expose a fail-on level per profile so the policy can change without a recompile, and record the level that was in force inside the job summary, because next quarter nobody will remember which threshold last October's batch ran under.
Isolating files so one bad PDF cannot sink the batch
procedure RunPreflightBatch(const InputDir, ReportDir: string;
out FilesWithFindings, ToolFailures: Integer);
var
SR: TSearchRec;
Pdf: TPdf;
Report: TPdfPreflightReport;
begin
FilesWithFindings := 0;
ToolFailures := 0;
if FindFirst(InputDir + '*.pdf', faAnyFile, SR) = 0 then
try
repeat
Pdf := TPdf.Create(nil); // fresh instance per file: no state bleed
try
try
Pdf.FileName := InputDir + SR.Name;
Pdf.Active := True;
if not Pdf.Active then // load failures are silent, not raised
raise EPdfError.Create('Cannot open ' + SR.Name);
Report := BuildPdfPreflightReport(Pdf, [ppsPdfA, ppsPdfUa]);
Report.SaveJsonToFile(ReportDir + ChangeFileExt(SR.Name, '.json'));
Report.SaveHtmlToFile(ReportDir + ChangeFileExt(SR.Name, '.html'));
if Report.TotalIssueCount > 0 then
Inc(FilesWithFindings);
except
on E: Exception do
begin
Inc(ToolFailures); // exit-code-2 territory, not a validation verdict
WriteLn(ErrOutput, SR.Name + ': ' + E.Message);
end;
end;
finally
Pdf.Free;
end;
until FindNext(SR) <> 0;
finally
FindClose(SR);
end;
end;
Three deliberate choices live in that loop. A fresh TPdf per file guarantees that a document which corrupts engine state cannot poison its successors. The explicit Active check matters because setting Active := True swallows load errors instead of raising — without the guard, a truncated file would drift into the validation call before failing, with a misleading message. The inner try..except sits inside the per-file scope, so an exception increments the failure counter and the loop continues — the auditor wants reports for the 4,999 good files even when file 5,000 is truncated. And both report formats are written before the verdict is tallied, so the evidence survives even if the summary logic has a bug.
The exit-code mapping then collapses to a few lines in the project file:
begin
RunPreflightBatch(ParamStr(1), ParamStr(2), Findings, Failures);
if Failures > 0 then
Halt(2)
else if Findings > 0 then
Halt(1);
// falling through exits with 0: every file conformed
end.
What preflight will not do for you
The engine detects; it does not repair. A finding about an unembedded font or a device-dependent color space is a work order for whoever produces the files, not something the validator can patch in place. Plan the feedback loop accordingly: reports need to land where the producing team reads them, or the same findings will reappear every night until someone wonders why the curve never bends. It also pays to cross-check a sample of verdicts against an independent validator — veraPDF for PDF/A, Acrobat's preflight for PDF/X — before an external auditor does it for you. Two engines agreeing is strong evidence; one engine alone is an opinion.
Frequently asked questions
Can I validate PDF/A and PDF/UA in the same pass?
Yes. BuildPdfPreflightReport takes a set of standards, so [ppsPdfA, ppsPdfUa] evaluates both in one run with one consolidated report. PDF/UA checks also pair naturally with viewer-side accessibility work such as the patterns in building an accessible PDF reader in Delphi.
Why does my report show no issues for a file veraPDF rejects?
First confirm the standard actually ran: walk Report.Results for an entry whose Standard matches and check its IsCompliant flag, rather than inferring from an empty findings list. If the standards match and the engines still disagree, keep the document as a named regression case — divergent verdicts on real customer files are exactly what release testing needs.
Does preflight fix the problems it finds?
No. It reports color, font, structure, and metadata violations with codes, categories, and recommendations; remediation happens in the producing workflow. Budget for that loop, not just for the check.
Profiles, report formats, and the full preflight API are documented on the product page: PDFium Component. The same engine drives interactive checks in a review UI, so this CLI and your PDF intake review workbench can share one validation vocabulary.