Technical Article

Building a workbook audit and conversion workbench in Delphi with HotXLS

A document migration hands you forty thousand spreadsheet files accumulated since 2003: BIFF-era .xls, modern .xlsx, a scattering of .ods from a LibreOffice pilot, a few files nobody can open because the password left with an employee. The mandate is to normalize everything to XLSX and CSV. The naive plan — loop, open, save-as — will produce forty thousand output files and no record of which ones lost their charts, dropped their macros, or never opened at all. What the job actually needs is a workbench: inventory first, convert second, verify third.

HotXLS gives a Delphi or C++Builder process everything required for that, with no Excel installation anywhere in the loop: two native engines (a BIFF8 facade and an OOXML facade), cheap probing calls that avoid full parses, per-sheet audit counters, and a documented conversion matrix. The catch is that each of those has edges worth knowing before the batch runs overnight.

Probe before you load: sheet names and encryption detection

Opening a 200 MB workbook just to discover it is encrypted wastes minutes per file; multiplied across an archive it wastes days. Both facades expose GetSheetNames, which reads sheet metadata without populating the workbook — the BIFF implementation scans only the BoundSheet records at the front of the stream, and the OOXML implementation reads only workbook.xml inside the zip. CanReadEncrypted detects an encryption container without attempting decryption:

var
  Probe: TXLSXWorkbook;
  Names: TStringList;
begin
  Names := TStringList.Create;
  Probe := TXLSXWorkbook.Create;
  try
    if Probe.CanReadEncrypted(FileName) then
    begin
      Writeln(FileName + ': encrypted container - route to manual handling');
      Exit;
    end;
    if Probe.GetSheetNames(FileName, Names) <= 0 then
      Writeln(FileName + ': unreadable - quarantine')
    else
      Writeln(Format('%s: %d sheet(s), first "%s"',
        [FileName, Names.Count, Names[0]]));
  finally
    Probe.Free;
    Names.Free;
  end;
end;

Two operational notes. GetSheetNames does not reset or populate the workbook instance, so one probe object can classify thousands of files in a tight loop. And the XLS-facade version of the same call also understands .xlsx packages, which makes it a convenient single probe when extensions are untrustworthy — as they always are in old archives. For a deeper treatment of triage-before-load, see our article on sheet listing and lightweight workbook inspection.

Counting what a workbook really contains

Once a file passes triage, the audit pass decides its conversion route. The XLSX facade exposes a counter for every feature family that matters to fidelity decisions — merged cells, charts, images, conditional formats, data validations, tables, hyperlinks, comments — plus workbook-level flags for macros, protection, and source format:

var
  Book: TXLSXWorkbook;
  Sheet: TXLSXWorksheet;
  I: Integer;
begin
  Book := TXLSXWorkbook.Create;
  try
    if Book.Open(FileName) <> 1 then Exit;
    for I := 0 to Book.Sheets.Count - 1 do
    begin
      Sheet := Book.Sheets[I];
      Writeln(Format('%s: cells=%d merges=%d charts=%d cf=%d dv=%d protected=%s',
        [Sheet.Name, Sheet.Cells.Count, Sheet.MergedCells.Count,
         Sheet.Charts.Count, Sheet.ConditionalFormats.Count,
         Sheet.DataValidations.Count, BoolToStr(Sheet.IsProtected, True)]));
    end;
    if Book.HasVbaProject then
      Writeln('  contains VBA project - macro policy applies');
    if Book.ExternalLinks.Count > 0 then
      Writeln(Format('  %d external link(s)', [Book.ExternalLinks.Count]));
  finally
    Book.Free;
  end;
end;

Interpret Cells.Count carefully: the cell store is sparse, so the number reflects instantiated cells, not the rectangular area of the used range. A sheet with two values in A1 and ZZ9999 counts two cells. On the BIFF side the equivalent scan uses UsedRange bounds and ForEachCell — and carries the classic HotXLS off-by-one: UsedRange.FirstRow and friends are 0-based while Cells.Item[Row, Col] is 1-based, so a traversal must add one to each bound or it audits the wrong rectangle.

Two levers speed up audit-only passes over large legacy files. Setting _DisableGraphics to true before opening an .xls skips the OfficeArt drawing-layer parse entirely, which is a substantial saving on workbooks dense with shapes — but it is strictly a read-only optimization, because saving from such an instance would drop the unparsed drawings. And when the audit needs per-cell content rather than counters, the ForEachCell callback walks the populated cells directly, avoiding the per-access Variant overhead of indexed cell properties across millions of reads.

Normalize the inconsistent return codes early

HotXLS I/O calls report errors through integer results rather than exceptions, and the conventions are not uniform across the API surface. Most open and save calls return 1 on success and -1 on failure; GetSheetNames returns the count or -1 and clears the list; XLSX SaveAsHTML returns 0 for success and -1 for a sheet index out of range. A workbench that tests = 1 everywhere will misclassify some successes, and one that tests <> -1 will miss others.

The pragmatic rule that holds across the whole surface: treat <= 0 as failure for count-returning calls, check the documented success value for each save routine you actually use, and wrap both in one small result-checking function so the convention lives in exactly one place. Batch pipelines die by the thousand-file pile-up of unchecked return codes, not by exotic parser bugs.

The conversion matrix — and where each road is lossy

The two facades divide the conversion work. TXLSXWorkbook opens XLSX, ODS, and CSV, and saves XLSX, ODS, CSV, HTML, RTF, and AES-encrypted XLSX. TXLSWorkbook opens and saves BIFF and exports HTML, RTF, and CSV. Every path has a documented fidelity profile rather than a vague promise:

CSV export writes UTF-8 with BOM, CRLF line ends, and RFC 4180 quoting — but formulas are emitted as literal text, never evaluated, so a sheet of =SUM(...) cells exports as formula strings unless you compute values first. HTML export produces a single table with colspan and rowspan for merges and inline base styles. RTF export cannot span merged cells across columns; continuation cells come out empty. ODS import is deliberately lightweight per the library's own documentation: scalar values and cached formula results arrive, while styles, live ODF formula expressions, and drawings do not — meaningful when the archive contains OpenDocument files governed by OASIS ODF 1.3, where a visually faithful conversion requires more than the import path provides.

SaveXLSWorkbookAsXLSX is a data bridge, not a layout bridge

The BIFF facade cannot write OOXML directly; the crossing is the SaveXLSWorkbookAsXLSX function from the lxXlsxExport unit. It is important to size its fidelity honestly: the bridge copies values, formulas, number formats, fill colors, core font attributes, column widths, and view settings such as gridlines — and it does not copy borders, merged ranges, comments, charts, or conditional formats. For data-grade normalization of an archive that downstream systems will parse, that is exactly enough. For presentation-grade conversion of a formatted board report, it is not, and the honest workbench routes such files to a manual queue instead of shipping a degraded copy.

var
  Legacy: IXLSWorkbook;        // interface reference: do not Free
  Modern: TXLSXWorkbook;
begin
  if SameText(ExtractFileExt(FileName), '.xls') then
  begin
    Legacy := TXLSWorkbook.Create;
    if Legacy.Open(FileName) <= 0 then Exit;
    if SaveXLSWorkbookAsXLSX(Legacy,
         ChangeFileExt(FileName, '.xlsx')) <= 0 then
      Writeln('bridge failed: ' + FileName);
  end
  else
  begin
    Modern := TXLSXWorkbook.Create;
    try
      Modern.StreamingWrite := True;     // stream sheet XML into the zip
      if Modern.Open(FileName) = 1 then
        Modern.SaveAsCSV(ChangeFileExt(FileName, '.csv'), 0, ',');
    finally
      Modern.Free;
    end;
  end;
end;

The loop above also shows the throughput lever for the OOXML side: StreamingWrite streams worksheet XML directly into the output package instead of staging it as one giant string, which matters once files reach hundreds of thousands of rows. Sizing and memory behavior for that mode are covered in our article on streaming writes for server batch jobs. Neither facade is thread-safe, but neither shares global state either — parallel conversion with one workbook instance per worker thread is the supported pattern.

Frequently asked questions

Can the workbench convert password-protected files?

Split the answer by format. Legacy .xls encryption (RC4, RC4 CryptoAPI, XOR) is readable: pass the password to Open and convert normally. Encrypted .xlsx packages are detectable with CanReadEncrypted but not decryptable by HotXLS — route them to a queue where someone opens and re-saves them in Excel first.

Do formulas get recalculated during conversion?

No save path evaluates formulas. Excel recalculates on open, so XLSX-to-XLSX conversion is safe; CSV targets receive formula text unless your pipeline evaluates the cells with Calculate and writes results back before exporting.

How should the workbench verify its own output?

Reopen every converted file with the same library and re-run the audit counters, then spot-check a sample in Excel or LibreOffice. Comparing pre- and post-conversion sheet counts and cell counts catches the overwhelming majority of silent conversion losses at negligible cost.

An audit-first workbench turns a risky bulk conversion into a measurable process with a quarantine lane. All of the probing, counting, and conversion calls shown here are part of the HotXLS Component, which runs them natively in-process without Excel automation.