When processing large batches of Excel spreadsheets in an automated pipeline, you rarely want to load the entire document into memory just to figure out what it is. Often, the metadata embedded within the file, author, title, creation date, and custom properties, is enough to route, index, or reject the document. In the Microsoft Office world, this metadata is known as Document Summary Information.
Extracting this information natively in Delphi without relying on OLE automation (which requires Excel to be installed on the host machine) requires parsing the underlying file structure directly. In this article, we’ll look at how document summaries work in Excel files and how to extract them efficiently using raw stream parsing.
Understanding Excel Metadata Streams
Historically, older Excel files (.xls) are stored in OLE Compound Document formats, effectively acting as mini file systems containing streams and storages. The metadata is housed in two specific streams:
SummaryInformation: Contains standard properties like Title, Subject, Author, Keywords, and Revision Number.DocumentSummaryInformation: Contains extended properties such as Company, Manager, and custom user-defined properties.
Modern Excel files (.xlsx) use the Office Open XML (OOXML) format, which is a zipped XML structure. The metadata here is located in docProps/core.xml, docProps/app.xml, and docProps/custom.xml. A robust Delphi parsing component must seamlessly handle both internal structures while exposing a unified API to the developer.
Parsing OLE Compound Documents in Delphi
To read the SummaryInformation from a legacy `.xls` file without third-party tools, you need to parse the OLE Structured Storage. Microsoft exposes this through the COM interface IPropertySetStorage. Here is a raw Delphi implementation that avoids firing up Excel:
uses
System.SysUtils, System.Win.ComObj, Winapi.ActiveX, Winapi.Windows;
procedure ExtractXlsSummaryInfo(const FileName: string);
var
Stg: IStorage;
PropSetStg: IPropertySetStorage;
PropStg: IPropertyStorage;
PropSpec: TPropSpec;
PropVariant: TPropVariant;
Hr: HRESULT;
begin
// Open the OLE Compound Document
Hr := StgOpenStorage(PWideChar(WideString(FileName)), nil,
STGM_READ or STGM_SHARE_DENY_WRITE, nil, 0, Stg);
if Failed(Hr) then
raise Exception.Create('Failed to open OLE storage. File may not be a valid .xls document.');
// Query for the property set storage interface
if Stg.QueryInterface(IPropertySetStorage, PropSetStg) = S_OK then
begin
// Open the SummaryInformation stream (FMTID_SummaryInformation)
Hr := PropSetStg.Open(FMTID_SummaryInformation, STGM_READ or STGM_SHARE_EXCLUSIVE, PropStg);
if Succeeded(Hr) then
begin
// Read the Author property (PIDSI_AUTHOR = 4)
PropSpec.ulKind := PRSPEC_PROPID;
PropSpec.propid := PIDSI_AUTHOR;
if PropStg.ReadMultiple(1, @PropSpec, @PropVariant) = S_OK then
begin
if PropVariant.vt = VT_LPSTR then
Writeln('Author: ', string(AnsiString(PropVariant.pszVal)));
PropVariantClear(PropVariant);
end;
end;
end;
end;
Programmatic Extraction with HotXLS
While the Windows COM API works for `.xls` files, it does not work for modern `.xlsx` files (which are ZIP archives). Furthermore, using COM API cross-platform (e.g., on Linux or macOS via FireMonkey) is impossible. Recent updates to the HotXLS component introduced dedicated units (e.g., lxXlsSummary) to isolate and optimise the reading of these summary streams across both formats completely natively in Delphi code.
A Cross-Platform Example
Using the XlsReadDocumentSummaryInformation and XlsReadSummaryInformation interfaces, you can quickly grab the metadata strings from both `.xls` and `.xlsx` without worrying about the underlying file system architecture.
uses
lxXlsSummary;
var
Summary: TXlsSummaryInfo;
ExtendedInfo: TXlsDocumentSummaryInfo;
begin
// Extract standard summary from an OOXML format seamlessly
Summary := XlsReadSummaryInformation('C:\Data\FinancialReport.xlsx');
try
Writeln('Title: ', Summary.Title);
Writeln('Author: ', Summary.Author);
Writeln('Creation Date: ', DateTimeToStr(Summary.CreateTime));
finally
Summary.Free;
end;
// Extract extended document summary
ExtendedInfo := XlsReadDocumentSummaryInformation('C:\Data\FinancialReport.xlsx');
try
Writeln('Company: ', ExtendedInfo.Company);
Writeln('Manager: ', ExtendedInfo.Manager);
finally
ExtendedInfo.Free;
end;
end;
Why Dedicated Summary Extraction Matters
The primary benefit of this approach is performance and memory safety. By avoiding the instantiation of the full workbook DOM (Document Object Model) and parsing only the docProps/core.xml or the OLE property streams, your application footprint remains incredibly small. If you are indexing 10,000 Excel files across a network share, attempting to fully parse each one will thrash your memory and take hours. Dedicated summary extraction completes the same task in seconds.
Furthermore, reading the streams natively ensures your application can run as a background service or on a headless Linux server without ever invoking Excel.exe, a critical requirement for modern scalable architectures.
Note: Comprehensive Excel parsing and metadata extraction tools are available in the HotXLS VCL Component.