Technical Article

Debugging Range Check Errors in Delphi PDF Libraries

· PDF Programming

When working with PDF manipulation libraries in Delphi, range check errors can be particularly frustrating because they often occur deep within complex document structures. These errors are especially challenging because they may appear intermittently, depending on the specific PDF structure being processed, making them difficult to reproduce and debug consistently. This comprehensive article explores a detailed debugging journey involving a range check error in a PDF page copying utility, demonstrating systematic approaches to identifying, analysing, and fixing such issues while also improving the overall software architecture.

The Initial Problem: A Deceptively Simple Command

The issue first manifested when running what appeared to be a straightforward command to copy pages from a PDF document:

1
CopyPage.exe input.pdf -page 1-3

This command, designed to extract pages 1 through 3 from a PDF file, would trigger a range check error at line 14783 in the HPDFDoc.pas file, specifically within the CopyPageFromDocument method. The error was particularly puzzling because it didn’t occur with all PDF files—only certain documents with specific internal structures would trigger the failure.

The intermittent nature of the bug suggested that the issue was related to boundary conditions or edge cases in the PDF processing logic. This is a common pattern in PDF manipulation software, where the vast diversity of PDF generation tools and document structures can expose subtle bugs that only manifest under specific conditions.

Understanding Range Check Errors in Delphi

Before diving into the specific debugging process, it’s important to understand what range check errors represent in Delphi applications. Range checking is a runtime safety feature that validates array bounds, string indices, and enumerated type assignments. When enabled (typically in debug builds), Delphi will throw an exception if code attempts to access array elements outside their allocated bounds.

Range check errors are particularly valuable during development because they catch potential buffer overruns and memory corruption issues that could lead to unpredictable behaviour or security vulnerabilities in production code. However, they can also be frustrating when they occur in complex, deeply nested code structures where the root cause isn’t immediately obvious.

Systematic Debugging Approach

Step 1: Reproducing and Isolating the Problem

The first step in any systematic debugging process is to create a reliable reproduction case. In this instance, the error occurred with specific PDF files but not others, which immediately suggested that the issue was related to document structure rather than general algorithmic problems.

Using a debugger, we traced the execution path to identify exactly where the bounds violation occurred. The error pointed to array access without proper bounds checking in the page object management code:

1
2
3
4
5
6
7
// Problematic code - accessing array without proper bounds check
if FDocStarted and (DestIndex < Length(PageArr)) and (PageArr[DestIndex].PageObj <> nil) then
begin
  // This array access could fail if DestIndex is negative or too large
  // The conditional logic doesn't properly protect against all edge cases
  Result := PageArr[DestIndex].PageObj;
end;

The issue became clearer upon closer examination of the conditional logic. While the code did include a bounds check (DestIndex < Length(PageArr)), the order of evaluation and the complexity of the compound condition created scenarios where the bounds check might not execute as expected.

Step 2: Analysing the Root Cause

The root cause analysis revealed several interconnected issues:

Conditional Logic Order: The primary issue was in the conditional logic order. The code evaluated FDocStarted first, followed by the bounds check. In certain execution paths, if FDocStarted was false but subsequent code still attempted to access the array, the bounds check might be bypassed.

Complex Boolean Expressions: The compound boolean expression made it difficult to reason about all possible execution paths. Complex conditions like this are prone to logical errors, especially when modified during maintenance.

Implicit Assumptions: The code made implicit assumptions about the relationship between FDocStarted and the validity of DestIndex. These assumptions weren’t always valid, particularly when processing PDFs with unusual structures.

Step 3: Implementing the Immediate Fix

The immediate fix focused on ensuring that bounds checking always occurred before array access, regardless of other conditions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Fixed code - bounds check first and foremost
if (DestIndex >= 0) and (DestIndex < Length(PageArr)) then
begin
  if FDocStarted and (PageArr[DestIndex].PageObj <> nil) then
  begin
    Result := PageArr[DestIndex].PageObj;
  end
  else
  begin
    // Handle the case where document isn't started or page object is nil
    Result := nil;
  end;
end
else
begin
  // Handle invalid index gracefully
  raise Exception.CreateFmt('Invalid page index: %d (valid range: 0-%d)',
                           [DestIndex, Length(PageArr) - 1]);
end;

This fix not only addressed the immediate range check error but also improved error handling by providing meaningful error messages when invalid indices are encountered.

Extending Functionality During Debugging

One of the valuable aspects of thorough debugging is that it often reveals opportunities for improvement beyond the immediate bug fix. While investigating the range check error, the user requested additional functionality: the ability to copy all pages from a document without explicitly specifying page ranges.

The requested enhancement was to make this command work:

1
CopyPage.exe input.pdf

This seemingly simple request required careful consideration of the command-line parsing logic and output file naming conventions. The implementation needed to handle several scenarios:

Automatic Output Filename Generation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Enhanced command-line processing with auto-generation
procedure ProcessCommandLine;
var
  InputBaseName, InputExt, OutputFile: string;
  i: Integer;
begin
  // Parse existing command-line arguments
  ParseArguments;
  
  // If no output files specified, generate automatic filename
  if Length(OutputFiles) = 0 then
  begin
    InputBaseName := ChangeFileExt(ExtractFileName(InputFile), '');
    InputExt := ExtractFileExt(InputFile);
    
    // Generate descriptive output filename
    OutputFile := InputBaseName + '-PageAll' + InputExt;
    SetLength(OutputFiles, 1);
    OutputFiles[0] := OutputFile;
    
    // Log the auto-generated filename for user feedback
    WriteLn('Auto-generated output file: ', OutputFile);
  end;
  
  // Validate that we have both input and output files
  if (InputFile = '') or (Length(OutputFiles) = 0) then
  begin
    ShowUsage;
    Halt(1);
  end;
end;

Page Range Processing Logic

The page processing logic also needed enhancement to handle the “copy all pages” scenario efficiently:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Enhanced page range processing
procedure DeterminePagesToCopy;
var
  i: Integer;
begin
  if PageRangeSpecified then
  begin
    // Use explicitly specified page ranges
    ParsePageRanges(PageRangeString, PageIndices);
    SetLength(PagesToCopy, Length(PageIndices));
    for i := 0 to High(PageIndices) do
      PagesToCopy[i] := PageIndices[i];
  end
  else
  begin
    // Copy all pages in document order
    SetLength(PagesToCopy, TotalPages);
    for i := 0 to TotalPages - 1 do
      PagesToCopy[i] := i;
    
    WriteLn(Format('Copying all %d pages from document', [TotalPages]));
  end;
end;

Uncovering Deeper Architectural Issues

As the debugging process continued, it revealed more fundamental problems in the codebase that went beyond the immediate range check error. These discoveries highlight why thorough debugging often leads to significant architectural improvements.

Hard-Coded Page Mapping Logic

The investigation uncovered problematic hard-coded page mapping logic that was attempting to compensate for perceived PDF structure issues:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Problematic hard-coded mapping discovered during debugging
procedure ApplyPageMapping;
begin
  if TotalPages = 3 then
  begin
    // Special case handling for 3-page documents
    // This was an attempt to fix page ordering issues
    PagesToCopy[0] := 1; // Display page 2 first
    PagesToCopy[1] := 2; // Display page 3 second  
    PagesToCopy[2] := 0; // Display page 1 last
    WriteLn('Applied 3-page document mapping');
  end
  else if TotalPages > 3 then
  begin
    // Generic swapping logic for larger documents
    PagesToCopy[0] := TotalPages - 1; // Last page first
    PagesToCopy[TotalPages - 1] := 0; // First page last
    
    // Keep middle pages in order
    for i := 1 to TotalPages - 2 do
      PagesToCopy[i] := i;
      
    WriteLn('Applied generic page reordering');
  end;
end;

This hard-coded logic was clearly a workaround for deeper issues with PDF page ordering. Such heuristic-based solutions are fragile and fail when encountering PDFs with different internal structures than those used during development.

The Dangers of Heuristic Programming

Heuristic-based solutions like the page mapping code above represent a common anti-pattern in software development. They typically arise when developers encounter unexpected behaviour and implement quick fixes based on observed patterns rather than understanding the underlying root cause.

The problems with heuristic solutions include:

  • Brittleness: They work only for the specific cases observed during development
  • Maintenance burden: Each new edge case requires additional heuristic rules
  • Unpredictability: Users can’t understand why their documents behave differently
  • Technical debt: The code becomes increasingly complex and difficult to maintain

The Importance of PDF Structure Understanding

The debugging process ultimately led to a deeper investigation of PDF internal structure, which revealed why the hard-coded mappings existed in the first place. This investigation highlights the importance of understanding the data formats your software processes.

PDF Object Storage vs. Display Order

PDF documents store pages as objects that can appear in any order within the file. The actual page sequence is determined by the Pages tree structure, not by object storage order:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
% Example PDF structure showing object vs. display order mismatch
1 0 obj
<< /Type /Catalog /Pages 2 0 R >>
endobj
 
2 0 obj  
<< /Type /Pages /Kids [20 0 R 1 0 R 4 0 R] /Count 3 >>
endobj
 
% Note: Pages appear in Kids array order [20, 1, 4]
% But objects are stored in file order [1, 2, 4, 20]
% Display order: Page 1 = Object 20, Page 2 = Object 1, Page 3 = Object 4
 
4 0 obj
<< /Type /Page /Contents 5 0 R /Parent 2 0 R >>
endobj
 
20 0 obj
<< /Type /Page /Contents 21 0 R /Parent 2 0 R >>
endobj

This structure explains why naive approaches to page processing (such as processing objects in file order) produce incorrect results.

Implementing Proper PDF Page Tree Traversal

The correct solution required implementing proper PDF page tree traversal:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Proper PDF page tree traversal implementation
function GetCorrectPageOrderFromPagesTree(Doc: TPDFDocument): Integer;
var
  CatalogObj, PagesObj: TPDFObject;
  KidsArray: TPDFArray;
  i: Integer;
  PageObj: TPDFObject;
begin
  Result := 0;
  
  try
    // Step 1: Find the document catalog (root object)
    CatalogObj := Doc.FindRootObject;
    if CatalogObj = nil then
    begin
      WriteLn('Warning: Could not find document catalog');
      Exit;
    end;
    
    // Step 2: Get the Pages object from catalog
    PagesObj := CatalogObj.GetIndirectObject('/Pages');
    if PagesObj = nil then
    begin
      WriteLn('Warning: Could not find Pages object in catalog');
      Exit;
    end;
    
    // Step 3: Extract the Kids array (page references)
    KidsArray := PagesObj.GetArray('/Kids');
    if KidsArray = nil then
    begin
      WriteLn('Warning: Could not find Kids array in Pages object');
      Exit;
    end;
    
    // Step 4: Process pages in Kids array order
    SetLength(Doc.PageArr, KidsArray.Count);
    for i := 0 to KidsArray.Count - 1 do
    begin
      PageObj := KidsArray.GetIndirectObject(i);
      if PageObj <> nil then
      begin
        Doc.PageArr[i].PageObj := PageObj;
        Doc.PageArr[i].PageIndex := i;
        Inc(Result);
      end;
    end;
    
    WriteLn(Format('Successfully ordered %d pages from PDF structure', [Result]));
    
  except
    on E: Exception do
    begin
      WriteLn('Error during page tree traversal: ', E.Message);
      Result := 0;
    end;
  end;
end;

Implementing Robust Fallback Mechanisms

Real-world PDF files often have structural anomalies or non-standard implementations. A robust PDF processing library must handle these edge cases gracefully:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Robust PDF page detection with multiple fallback strategies
function ReorderPageArrByPagesTree(Doc: TPDFDocument): Boolean;
var
  i: Integer;
  Obj: TPDFObject;
  KidsArray: TPDFArray;
begin
  Result := False;
  
  // Primary method: Standard PDF structure traversal
  if TryStandardPageTreeTraversal(Doc) then
  begin
    Result := True;
    WriteLn('Used standard PDF page tree traversal');
    Exit;
  end;
  
  // Fallback 1: Search for any object with Kids array
  WriteLn('Standard traversal failed, trying fallback method...');
  for i := 0 to Doc.Objects.Count - 1 do
  begin
    Obj := Doc.Objects[i];
    if (Obj <> nil) and Obj.HasKey('/Kids') then
    begin
      KidsArray := Obj.GetArray('/Kids');
      if (KidsArray <> nil) and (KidsArray.Count > 0) then
      begin
        if ProcessKidsArray(Doc, KidsArray) then
        begin
          Result := True;
          WriteLn('Successfully used fallback Kids array processing');
          Exit;
        end;
      end;
    end;
  end;
  
  // Fallback 2: Sequential page object discovery
  if not Result then
  begin
    WriteLn('All structured methods failed, using sequential discovery...');
    Result := DiscoverPagesSequentially(Doc);
  end;
  
  if not Result then
    WriteLn('Warning: All page discovery methods failed');
end;

Testing and Validation Strategies

Comprehensive testing is crucial when dealing with PDF processing bugs, especially those that only manifest with specific document structures.

Creating Diverse Test Cases

1
2
3
4
5
6
7
8
9
10
11
12
# Test case generation for PDF page ordering
# Test 1: Standard sequential PDF
pdftk A=page1.pdf B=page2.pdf C=page3.pdf cat A B C output sequential.pdf
 
# Test 2: Non-sequential object IDs
pdftk A=page3.pdf B=page1.pdf C=page2.pdf cat A B C output non-sequential.pdf
 
# Test 3: Large document with mixed page sizes
pdftk A=large-doc.pdf cat 50-52 25-27 1-3 output mixed-ranges.pdf
 
# Test 4: Single page document
pdftk A=multi-page.pdf cat 1 output single-page.pdf

Automated Testing Framework

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Automated testing for PDF page ordering
procedure RunPageOrderingTests;
var
  TestFiles: array of string;
  i: Integer;
  TestResult: Boolean;
begin
  TestFiles := ['sequential.pdf', 'non-sequential.pdf', 'mixed-ranges.pdf', 'single-page.pdf'];
  
  WriteLn('Running PDF page ordering tests...');
  for i := 0 to High(TestFiles) do
  begin
    Write(Format('Testing %s... ', [TestFiles[i]]));
    TestResult := ValidatePageOrdering(TestFiles[i]);
    if TestResult then
      WriteLn('PASS')
    else
      WriteLn('FAIL');
  end;
end;
 
function ValidatePageOrdering(const FileName: string): Boolean;
var
  Doc: TPDFDocument;
  ExpectedOrder, ActualOrder: TIntegerArray;
begin
  Result := False;
  Doc := TPDFDocument.Create;
  try
    if Doc.LoadFromFile(FileName) then
    begin
      ExpectedOrder := GetExpectedPageOrder(FileName);
      ActualOrder := GetActualPageOrder(Doc);
      Result := ComparePageOrders(ExpectedOrder, ActualOrder);
    end;
  finally
    Doc.Free;
  end;
end;

Performance Considerations and Optimisation

While fixing the range check error and implementing proper PDF structure handling, it’s important to consider performance implications:

Memory Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Efficient memory management for large PDF processing
procedure ProcessLargePDF(const FileName: string);
var
  Doc: TPDFDocument;
  PageCache: TPageCache;
  i: Integer;
begin
  Doc := TPDFDocument.Create;
  PageCache := TPageCache.Create(100); // Cache up to 100 pages
  try
    Doc.LoadFromFile(FileName);
    
    // Process pages in chunks to manage memory usage
    for i := 0 to Doc.PageCount - 1 do
    begin
      ProcessSinglePage(Doc, i, PageCache);
      
      // Periodic garbage collection for large documents
      if (i mod 50) = 0 then
      begin
        PageCache.ClearOldEntries;
        CollectGarbage;
      end;
    end;
  finally
    PageCache.Free;
    Doc.Free;
  end;
end;

Lessons Learned and Best Practices

1. Always Prioritize Bounds Checking

When dealing with array access, always perform bounds checking as the first condition in complex boolean expressions. Consider using helper functions to encapsulate safe array access patterns.

2. Understand Your Data Format

Invest time in thoroughly understanding the specifications of complex data formats like PDF. This understanding prevents the need for heuristic workarounds and leads to more robust solutions.

3. Avoid Hard-Coded Logic

Hard-coded mappings and heuristic solutions should be replaced with structure-aware algorithms that follow the format specifications.

4. Implement Comprehensive Error Handling

Provide meaningful error messages and graceful degradation when encountering unexpected conditions.

5. Test with Diverse Inputs

Range check errors and structural issues often depend on specific data patterns. Create comprehensive test suites that cover various document structures and edge cases.

6. Document Your Assumptions

Clearly document any assumptions your code makes about data structure or format compliance. This helps future maintainers understand the reasoning behind implementation decisions.

Conclusion

Debugging range check errors in PDF libraries requires a systematic approach that combines careful code analysis, deep understanding of the PDF format, and comprehensive testing strategies. This case study demonstrates that thorough debugging often reveals opportunities for significant architectural improvements beyond the immediate bug fix.

The key takeaways from this debugging journey include the importance of understanding data format specifications, avoiding heuristic solutions in favor of specification-compliant implementations, and building robust error handling and fallback mechanisms. By following these principles, developers can create more reliable PDF processing applications that handle diverse document structures correctly.

Most importantly, this case study illustrates that debugging is not just about fixing immediate problems—it’s an opportunity to improve software architecture, enhance functionality, and build more maintainable code. The investment in thorough debugging and proper implementation pays dividends in reduced support burden, improved user satisfaction, and easier future maintenance.