Technical Article

Stream Huge XLSX Files in Delphi Without Loading Them

A spreadsheet with a million rows and a dozen columns is a perfectly ordinary export from a database reporting job. Open it the usual way, by loading the whole workbook into a TXLSWorkbook, and the process has to materialise every one of those twelve million cells as a live object before your first line of business logic runs. The file on disk might be sixty megabytes of compressed XML. The object tree it expands into is several times that, and it all has to be resident at once because the model is random-access by design. For a report you intend to read top to bottom and throw away, that is a great deal of memory spent on a structure you never needed

There is a second path through the same file. Instead of building a model, you scan the worksheet XML forward only, one cell at a time, and let each cell flow past after you have looked at it. Nothing accumulates. Memory stays near constant whether the sheet has a thousand rows or ten million, because the reader is never holding more than the part it is currently parsing plus a couple of small lookup tables. This is what the HotXLS direct reader does, and the rest of this article is about why it stays small and what it gives you in exchange

Why the in-memory model does not scale

An XLSX file is a ZIP package of XML parts described by ECMA-376. Each worksheet is its own part, xl/worksheets/sheetN.xml, and inside it every row is a <row> element holding <c> cell elements. The regular load path reads that part and constructs an addressable object for each cell so that you can later ask for Cells[12345, 7] and get an answer in constant time. Random access is the whole point of a workbook model, and it is exactly what makes editing, formula evaluation, and styling convenient

The cost is that random access requires everything to be present simultaneously. You cannot index into a structure you have only partially built. So the peak memory of a full load is a function of cell count, and on a sheet with millions of populated cells that function lands somewhere your service does not want to be, especially if several such jobs run at once on a shared machine. When the access pattern you actually need is sequential, paying for random access is paying for a capability you will not use

A forward-only SAX scan that builds no tree

The direct reader opens the ZIP package and walks each worksheet part with a SAX-style pull parser. SAX here means the parser reports parse events as it encounters them, a start element, a text run, an end element, and then moves on. It keeps no node tree behind it. The reader tracks the current row and column from the r attributes, gathers the cell's type, style index, value, and formula text as the events arrive, and when the closing </c> tag is seen it emits one cell and forgets it. The next cell reuses the same handful of local variables

Because nothing is retained between cells, the memory footprint does not grow with the number of cells. That is the property worth holding onto. A two-hundred-row sheet and a twenty-million-row sheet cost the reader the same resident memory, and the difference between them is only how long the scan runs. You give up random access, the model's headline feature, and in return you get a ceiling on memory that the cell count cannot push through

What stays resident, and why those two parts

The scan is not entirely stateless, and the exceptions are instructive. Two small tables have to be held in memory for the duration, because a cell on its own does not carry enough information to interpret without them

The first is the shared string table. In SpreadsheetML, a text cell does not store its own text. It carries t="s" and a numeric payload that is an index into xl/sharedStrings.xml, a single deduplicated list of every distinct string in the workbook. This is a good space trade for files where the same labels repeat across thousands of rows, but it means the reader has to load that string table up front and keep it resident, because any cell anywhere in any sheet may reference any entry in it. The table is sized by the number of distinct strings, not by cell count, so it stays modest even on enormous sheets

The second is the number-format mapping from the styles part. A numeric cell and a date cell are byte-for-byte the same on the wire: both are a plain number, because a date in SpreadsheetML is just a serial day count. The only thing that distinguishes them is the cell's style, which points through cellXfs in xl/styles.xml to a number-format id. To report a date as a date rather than as the raw serial number, the reader loads that style-to-format table and keeps it resident. Everything else in the file, the actual cell data that makes up the bulk of the bytes, streams past without being stored

Every cell reports a kind and a value

Each emitted cell arrives as a TXLSDirectCell record. It carries the sheet index and name, the 1-based row and column, a semantic Kind, the Value as a Variant, the Formula text without its leading equals sign, and the raw StyleIndex. The kind is one of xdkNumber, xdkString, xdkBoolean, xdkDate, or xdkError, so you can branch on what the cell means rather than re-deriving it from attributes. A formula cell reports the kind of its cached result, with the formula text alongside, so a computed total comes through as a number that also tells you how it was produced

type
  TReportScan = class
    procedure OnCell(Sender: TObject; const Cell: TXLSDirectCell;
      var Abort: Boolean);
  end;

procedure TReportScan.OnCell(Sender: TObject; const Cell: TXLSDirectCell;
  var Abort: Boolean);
begin
  case Cell.Kind of
    xdkString:  AccumulateLabel(Cell.Row, Cell.Col, VarToStr(Cell.Value));
    xdkNumber:  AddToTotals(Cell.Col, Double(Cell.Value));
    xdkDate:    NoteWhen(Cell.Row, VarToDateTime(Cell.Value));
    xdkBoolean: FlagRow(Cell.Row, Boolean(Cell.Value));
    xdkError:   LogBadCell(Cell.Row, Cell.Col, VarToStr(Cell.Value));
  end;
end;

Telling a date from a number

The date question deserves a closer look because it is where most naive scanners go wrong. There is no date type on a numeric cell. A cell holding the serial value 46000 could be a quantity, a price, or the 17th of February 2025, and the file tells you which only through the number-format id reached via the cell's style. ECMA-376 reserves a block of built-in format ids whose meaning is fixed across every conforming producer, and the date-bearing ids sit in two ranges: 14 through 22 for the standard date and time formats, and 45 through 47 for the elapsed-time formats such as [h]:mm:ss. When DetectDates is on, which it is by default, the reader resolves each numeric cell's style to its format id, and a cell whose id falls in those reserved ranges is reported as xdkDate with its Value already converted to a Delphi TDateTime. Custom formats are checked too, by inspecting the format code for date and time tokens, but the reserved ranges are the dependable backbone. Turn DetectDates off and the styles table is not even loaded, every numeric cell comes through as xdkNumber, and the scan is fractionally leaner

Skip sheets and abort early

Sequential scanning has a quiet advantage that random access cannot match: you can stop. The OnSheet event fires before each worksheet is opened, and it gives you two switches. Set SkipSheet and that whole part is never parsed, which is how you scan only the sheets you care about in a multi-sheet workbook without paying to read the rest. Set Abort and the entire scan ends immediately. The OnCell event carries its own Abort, so you can halt the moment you have found what you were looking for, a particular row, a sentinel value, the end of a header block, without reading the remaining millions of cells. On a forward-only scan, abort is genuinely free, because the work you skip is work that had not happened yet

procedure TReportScan.OnSheet(Sender: TObject; SheetIndex: Integer;
  const SheetName: WideString; var SkipSheet: Boolean; var Abort: Boolean);
begin
  // Scan only the "Data" sheet; leave the rest unread
  SkipSheet := SheetName <> 'Data';
end;

Counting cells without a handler

One recent refinement is worth calling out because it turns a common question into a single cheap call. The reader counts every populated cell it passes, and it does this whether or not an OnCell handler is attached. Earlier, with no handler set, the populated-cell count came back as zero, since counting was a side effect of emitting. Now the count is independent of emission. That means you can ask one question, how many populated cells does this workbook actually contain, and get the answer for the price of a scan with no callbacks at all. ReadFile and ReadStream both return that total as an Int64, and the same number is available afterward as the CellCount property. A return of -1 signals that the file could not be opened or is not an OOXML package

var
  Reader: TXLSDirectReader;
  Populated: Int64;
begin
  Reader := TXLSDirectReader.Create;
  try
    // No OnCell handler: a pure populated-cell census, still near-constant memory
    Populated := Reader.ReadFile('quarterly_export.xlsx');
    if Populated < 0 then
      raise Exception.Create('Not a readable XLSX package')
    else
      Writeln(Format('%d populated cells (CellCount = %d)',
        [Populated, Reader.CellCount]));
  finally
    Reader.Free;
  end;
end;

For the full scan, you attach the handler and call ReadFile exactly the same way. The contrast with a full load is the whole point: where loading quarterly_export.xlsx into a workbook would expand every cell into a resident object and hold the lot, the direct reader keeps only the shared strings and the style table while the twelve million cells flow through your OnCell one at a time. The arithmetic that ran per cell leaves nothing behind, so peak memory is set by the workbook's distinct-string count, not by its row count

The direct reader is the right tool when the job is to read a large workbook once and extract or summarise it. When you instead need the random access of the full model but want it to behave on big files, the tuning in our notes on large-workbook performance in Delphi covers that path. And when the direction is reversed, producing large output rather than consuming it, the streaming-write walkthrough for server batch jobs applies the same constant-memory discipline to writing. All three ship as part of the HotXLS Component for Delphi and C++Builder, alongside the reading, writing, formula, and formatting APIs covered elsewhere on this blog