Technical Article

Extracting Document Summary Information from Excel Files in Delphi

When processing large batches of Excel spreadsheets in an automated pipeline, you rarely want to load the entire document into memory just to figure out what it is. Often, the metadata embedded within the file—author, title, creation date, and custom properties—is enough to route, index, or reject the document. In the Microsoft Office world, this metadata is known as Document Summary Information.

Extracting this information natively in Delphi without relying on OLE automation (which requires Excel to be installed on the host machine) requires parsing the underlying file structure directly. In this article, we’ll look at how document summaries work in Excel files and how to extract them efficiently using raw stream parsing.

Understanding Excel Metadata Streams

Historically, older Excel files (.xls) are stored in OLE Compound Document formats, effectively acting as mini file systems containing streams and storages. The metadata is housed in two specific streams:

  • SummaryInformation: Contains standard properties like Title, Subject, Author, Keywords, and Revision Number.
  • DocumentSummaryInformation: Contains extended properties such as Company, Manager, and custom user-defined properties.

Modern Excel files (.xlsx) use the Office Open XML (OOXML) format, which is a zipped XML structure. The metadata here is located in docProps/core.xml, docProps/app.xml, and docProps/custom.xml. A robust Delphi parsing component must seamlessly handle both internal structures while exposing a unified API to the developer.

Parsing OLE Compound Documents in Delphi

To read the SummaryInformation from a legacy `.xls` file without third-party tools, you need to parse the OLE Structured Storage. Microsoft exposes this through the COM interface IPropertySetStorage. Here is a raw Delphi implementation that avoids firing up Excel:

uses
  System.SysUtils, System.Win.ComObj, Winapi.ActiveX, Winapi.Windows;

procedure ExtractXlsSummaryInfo(const FileName: string);
var
  Stg: IStorage;
  PropSetStg: IPropertySetStorage;
  PropStg: IPropertyStorage;
  PropSpec: TPropSpec;
  PropVariant: TPropVariant;
  Hr: HRESULT;
begin
  // Open the OLE Compound Document
  Hr := StgOpenStorage(PWideChar(WideString(FileName)), nil,
    STGM_READ or STGM_SHARE_DENY_WRITE, nil, 0, Stg);
    
  if Failed(Hr) then
    raise Exception.Create('Failed to open OLE storage. File may not be a valid .xls document.');

  // Query for the property set storage interface
  if Stg.QueryInterface(IPropertySetStorage, PropSetStg) = S_OK then
  begin
    // Open the SummaryInformation stream (FMTID_SummaryInformation)
    Hr := PropSetStg.Open(FMTID_SummaryInformation, STGM_READ or STGM_SHARE_EXCLUSIVE, PropStg);
    if Succeeded(Hr) then
    begin
      // Read the Author property (PIDSI_AUTHOR = 4)
      PropSpec.ulKind := PRSPEC_PROPID;
      PropSpec.propid := PIDSI_AUTHOR;
      
      if PropStg.ReadMultiple(1, @PropSpec, @PropVariant) = S_OK then
      begin
        if PropVariant.vt = VT_LPSTR then
          Writeln('Author: ', string(AnsiString(PropVariant.pszVal)));
        PropVariantClear(PropVariant);
      end;
    end;
  end;
end;

Programmatic Extraction with HotXLS

While the Windows COM API works for `.xls` files, it does not work for modern `.xlsx` files (which are ZIP archives). Furthermore, using COM API cross-platform (e.g., on Linux or macOS via FireMonkey) is impossible. Recent updates to the HotXLS component introduced dedicated units (e.g., lxXlsSummary) to isolate and optimize the reading of these summary streams across both formats completely natively in Delphi code.

A Cross-Platform Example

Using the XlsReadDocumentSummaryInformation and XlsReadSummaryInformation interfaces, you can quickly grab the metadata strings from both `.xls` and `.xlsx` without worrying about the underlying file system architecture.

uses
  lxXlsSummary;

var
  Summary: TXlsSummaryInfo;
  ExtendedInfo: TXlsDocumentSummaryInfo;
begin
  // Extract standard summary from an OOXML format seamlessly
  Summary := XlsReadSummaryInformation('C:\Data\FinancialReport.xlsx');
  try
    Writeln('Title: ', Summary.Title);
    Writeln('Author: ', Summary.Author);
    Writeln('Creation Date: ', DateTimeToStr(Summary.CreateTime));
  finally
    Summary.Free;
  end;

  // Extract extended document summary
  ExtendedInfo := XlsReadDocumentSummaryInformation('C:\Data\FinancialReport.xlsx');
  try
    Writeln('Company: ', ExtendedInfo.Company);
    Writeln('Manager: ', ExtendedInfo.Manager);
  finally
    ExtendedInfo.Free;
  end;
end;

Why Dedicated Summary Extraction Matters

The primary benefit of this approach is performance and memory safety. By avoiding the instantiation of the full workbook DOM (Document Object Model) and parsing only the docProps/core.xml or the OLE property streams, your application footprint remains incredibly small. If you are indexing 10,000 Excel files across a network share, attempting to fully parse each one will thrash your memory and take hours. Dedicated summary extraction completes the same task in seconds.

Furthermore, reading the streams natively ensures your application can run as a background service or on a headless Linux server without ever invoking Excel.exe—a critical requirement for modern scalable architectures.

Note: Comprehensive Excel parsing and metadata extraction tools are available in the HotXLS VCL Component.