Technical Article

Optimizing IO Performance for Gigabyte-Scale PDF Processing

Processing standard PDFs (1MB to 10MB) in Delphi is straightforward using standard stream classes like TFileStream or TMemoryStream. However, when you are tasked with processing gigabyte-scale PDFs—such as massive engineering CAD schematics, high-resolution geospatial maps, or accumulated legal archives—standard memory allocation techniques quickly break down.

If you load a 2GB PDF into a TMemoryStream in a 32-bit Delphi application, you will immediately hit an EOutOfMemory exception. Even in 64-bit applications, doing so causes severe page faulting and grinds the server to a halt. In this article, we'll explore how to optimize I/O performance for massive files using Memory-Mapped Files.

The Problem with Standard Streams

When you use TMemoryStream.LoadFromFile, the OS reads the file from disk, allocates sequential RAM, and copies the data into it. For a 2GB file, this wastes 2GB of physical RAM and takes significant time just for the disk read loop.

Even using TFileStream can be problematic if you are jumping around the file frequently (e.g., parsing the PDF XRef table at the end of the file, then jumping to objects scattered throughout the file). The continuous Seek and Read calls result in high kernel transition overhead.

The Solution: Memory-Mapped Files

Memory mapping (via the Windows API functions CreateFileMapping and MapViewOfFile) asks the OS to map the file directly into the application's virtual address space. You get a pointer to the data, and the Windows Virtual Memory Manager handles paging the data in and out of physical RAM strictly as you access it.

Here is how you can implement a high-performance memory-mapped file reader in Delphi for PDF parsing:

uses
  Winapi.Windows, System.SysUtils, System.Classes;

type
  TMemoryMappedFileReader = class
  private
    FFileHandle: THandle;
    FMappingHandle: THandle;
    FDataPtr: Pointer;
    FFileSize: Int64;
  public
    constructor Create(const FileName: string);
    destructor Destroy; override;
    property Data: Pointer read FDataPtr;
    property Size: Int64 read FFileSize;
  end;

constructor TMemoryMappedFileReader.Create(const FileName: string);
var
  HighSize, LowSize: DWORD;
begin
  // Open the file with read permissions
  FFileHandle := CreateFile(PChar(FileName), GENERIC_READ, FILE_SHARE_READ, nil,
    OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
  if FFileHandle = INVALID_HANDLE_VALUE then
    RaiseLastOSError;

  // Get the 64-bit file size
  LowSize := GetFileSize(FFileHandle, @HighSize);
  FFileSize := (Int64(HighSize) shl 32) or LowSize;

  // Create the mapping object
  FMappingHandle := CreateFileMapping(FFileHandle, nil, PAGE_READONLY, HighSize, LowSize, nil);
  if FMappingHandle = 0 then
    RaiseLastOSError;

  // Map the file into the virtual address space
  FDataPtr := MapViewOfFile(FMappingHandle, FILE_MAP_READ, 0, 0, 0);
  if FDataPtr = nil then
    RaiseLastOSError;
end;

destructor TMemoryMappedFileReader.Destroy;
begin
  if FDataPtr <> nil then UnmapViewOfFile(FDataPtr);
  if FMappingHandle <> 0 then CloseHandle(FMappingHandle);
  if FFileHandle <> INVALID_HANDLE_VALUE then CloseHandle(FFileHandle);
  inherited;
end;

Why Memory Mapping Dominates PDF Parsing

PDF is a random-access format. The parser starts by reading the trailer at the end of the file, finds the XRef table, and then jumps randomly to byte offsets throughout the file to load specific dictionaries and streams.

With memory mapping:

  1. Zero-Copy: Data is not copied from kernel space to user space; you read directly from the OS file cache.
  2. Instant Loading: Opening a 2GB PDF takes milliseconds, as no data is actually read from disk until you dereference the pointer.
  3. OS Managed Paging: If you only parse 50MB of data out of the 2GB file, the OS only loads those 50MB into physical RAM. Memory consumption remains tiny.

By implementing a custom stream class backed by memory-mapped files, your Delphi application can chew through gigabyte-scale PDFs with ease, dramatically improving performance and scalability.

Note: Optimized I/O stream handling for massive documents is built directly into the HotPDF VCL Component.