PDF 結構背後的隱藏複雜性
PDF 文件比使用者看到的要複雜得多。雖然檢視器以邏輯、順序的排列方式顯示頁面(1、2、3…),但 PDF 檔案的內部架構卻講述著完全不同的故事。這種複雜性是 PDF 處理中最常被誤解的方面之一,導致了無數的錯誤、不正確的實現以及沮喪的開發人員。本文深入探討了 PDF 頁面組織錯綜複雜的世界,解釋了為什麼開發人員經常遇到意外的頁面排序問題,並提供了用於可靠 PDF 處理的實用解決方案。
PDF 物件模型:從順序文件到一種新的範式
為了理解 PDF 頁面排序的挑戰,我們首先必須認識到 PDF 與更簡單的文件格式有何根本不同。與純文本檔案、HTML 文件,甚至像 RTF 這樣較舊的格式不同,PDF 採用了一種複雜的基於物件的架構,其中內容組織和物理儲存完全分離。
做出這個架構決策的原因有很多:
- 靈活性: 物件可以從多個位置引用,而無需重複。
- 效率: 常見的資源(字型、影像、圖形狀態)可以在頁面之間共享。
- 增量更新: 可以修改文件,而無需重寫整個檔案。
- 隨機訪問: 觀看者可以跳到任何頁面,而無需解析整個文件。
然而,這種靈活性是以複雜性為代價的,尤其是在理解物件儲存順序與邏輯頁面順序之間的關係時。
物件引用與顯示順序:一個具體的例子。
考慮以下典型的 PDF 結構,它說明了儲存和顯示之間的脫節:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
% PDF file structure example - storage order vs. display order %PDF-1.4 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [20 0 R 1 0 R 4 0 R] /Count 3 >> endobj % Object 4 appears third in file but represents page 3 in display 4 0 obj << /Type /Page /Contents 5 0 R /Parent 2 0 R /MediaBox [0 0 612 792] /Resources << /Font << /F1 6 0 R >> >> >> endobj % Object 20 appears last in file but represents page 1 in display 20 0 obj << /Type /Page /Contents 21 0 R /Parent 2 0 R /MediaBox [0 0 612 792] /Resources << /Font << /F1 6 0 R >> >> >> endobj |
在這個例子中,頁面物件儲存為物件 4 和 20,但顯示順序由 Kids 陣列定義:[20, 1, 4]。這建立了以下對映:
- 頁面 1 (顯示順序) = 物件 20 (儲存順序:最後)。
- 頁面 2 (顯示順序) = 物件 1 (儲存順序:第一)。
- 第 3 頁 (顯示順序) = 第 4 個物件 (儲存順序:第三個)。
這種不一致並非偶然,而是 PDF 的一個基本特性,它支援高階文件處理和最佳化。
PDF 生成器建立非順序物件順序的原因。
瞭解 PDF 生成器為什麼建立非順序物件順序,可以幫助開發人員認識到他們所面臨的複雜性,並避免對文件結構做出錯誤的假設。
PDF 建立工作流程。
不同的 PDF 建立工作流程會導致不同的物件排序模式:
1. 順序文件建立。
|
1 2 3 4 5 6 |
% Typical output from simple PDF generators 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [3 0 R 4 0 R 5 0 R] /Count 3 >> endobj 3 0 obj << /Type /Page /Contents 6 0 R /Parent 2 0 R >> endobj 4 0 obj << /Type /Page /Contents 7 0 R /Parent 2 0 R >> endobj 5 0 obj << /Type /Page /Contents 8 0 R /Parent 2 0 R >> endobj |
2. 最佳化資源共享。
|
1 2 3 4 5 6 7 8 9 |
% PDF with shared resources created first 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [10 0 R 11 0 R 12 0 R] /Count 3 >> endobj 3 0 obj << /Type /Font /Subtype /Type1 /BaseFont /Helvetica >> endobj 4 0 obj << /Type /XObject /Subtype /Image /Width 100 /Height 100 >> endobj % ... more shared resources ... 10 0 obj << /Type /Page /Resources << /Font << /F1 3 0 R >> >> >> endobj 11 0 obj << /Type /Page /Resources << /XObject << /Im1 4 0 R >> >> >> endobj 12 0 obj << /Type /Page /Resources << /Font << /F1 3 0 R >> >> >> endobj |
3. 增量文件組裝
|
1 2 3 4 5 6 7 8 9 |
% Document created by combining existing PDFs 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [100 0 R 25 0 R 75 0 R] /Count 3 >> endobj % Objects from first source document 25 0 obj << /Type /Page /Contents 26 0 R /Parent 2 0 R >> endobj % Objects from second source document 75 0 obj << /Type /Page /Contents 76 0 R /Parent 2 0 R >> endobj % Objects from third source document 100 0 obj << /Type /Page /Contents 101 0 R /Parent 2 0 R >> endobj |
常見的開發錯誤及其後果
PDF 結構的複雜性會導致一些常見的錯誤,這些錯誤可能對應用程式的可靠性和使用者體驗產生嚴重後果。
錯誤 1:假設物件 ID 順序等於顯示順序
這可能是 PDF 處理新手開發人員最常犯的錯誤:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
// WRONG: Processing pages by object ID order function GetPagesInWrongOrder(Doc: TPDFDocument): TPageList; var i: Integer; Obj: TPDFObject; begin Result := TPageList.Create; // This approach processes pages in storage order, not display order for i := 0 to Doc.Objects.Count - 1 do begin Obj := Doc.Objects[i]; if (Obj <> nil) and (Obj.GetValue('/Type') = '/Page') then begin Result.Add(Obj); // Wrong order! end; end; // Result will be in object ID order: [1, 4, 20] // But display order should be: [20, 1, 4] end; |
此錯誤的後果包括:
- 輸出文件中的頁面順序不正確
- 頁面編號不一致
- 使用者困惑和支援請求。
- 文件處理流程中潛在的資料損壞。
錯誤 2:基於觀察結果的硬編碼頁面對映。
當開發人員遇到頁面排序問題時,他們有時會根據觀察到的模式實施硬編碼的修復。
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
// WRONG: Hard-coded page reordering based on heuristics function ApplyPageReorderingHeuristics(Pages: TPageArray): TPageArray; var i: Integer; begin SetLength(Result, Length(Pages)); // Dangerous heuristic based on limited observations if Length(Pages) = 3 then begin // "Fix" for specific 3-page documents observed during testing Result[0] := Pages[1]; // Put second page first Result[1] := Pages[2]; // Put third page second Result[2] := Pages[0]; // Put first page last end else if Length(Pages) > 3 then begin // Generic "fix" that swaps first and last pages Result[0] := Pages[Length(Pages) - 1]; Result[Length(Pages) - 1] := Pages[0]; // Keep middle pages in original order for i := 1 to Length(Pages) - 2 do Result[i] := Pages[i]; end else begin // For other cases, just copy as-is for i := 0 to High(Pages) do Result[i] := Pages[i]; end; end; |
這種方法從根本上存在缺陷,因為:
- 它僅適用於開發過程中觀察到的特定 PDF 檔案。
- 對於具有不同結構的 PDF 檔案,它會徹底失敗。
- 它會產生使用者無法理解的不可預測行為。
- 隨著新增的特殊情況增多,它會積累技術債務。
錯誤 3:忽略分層頁面樹。
許多開發人員認為 PDF 頁面樹始終是扁平陣列,但 PDF 規範允許使用分層結構:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
// WRONG: Assuming flat page tree structure function GetPagesFromFlatTree(PagesObj: TPDFObject): TPageArray; var KidsArray: TPDFArray; i: Integer; begin KidsArray := PagesObj.GetArray('/Kids'); if KidsArray = nil then Exit; SetLength(Result, KidsArray.Count); for i := 0 to KidsArray.Count - 1 do begin // This assumes all Kids entries are Page objects // But they might be intermediate Pages objects! Result[i] := KidsArray.GetIndirectObject(i); end; end; |
正確的方法:遵循頁面樹結構。
處理 PDF 頁面順序的正確方法是實現完整的頁面樹遍歷,該遍歷完全遵循 PDF 規範。
理解頁面樹層級結構。
PDF 頁面樹可以是分層的,其中包含中間 Pages 物件,這些物件包含自己的 Kids 陣列:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
% Hierarchical page tree example 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj % Root Pages object 2 0 obj << /Type /Pages /Kids [3 0 R 8 0 R 15 0 R] /Count 7 >> endobj % First intermediate Pages object (contains 3 pages) 3 0 obj << /Type /Pages /Kids [4 0 R 5 0 R 6 0 R] /Count 3 /Parent 2 0 R >> endobj % Second intermediate Pages object (contains 2 pages) 8 0 obj << /Type /Pages /Kids [9 0 R 10 0 R] /Count 2 /Parent 2 0 R >> endobj % Third intermediate Pages object (contains 2 pages) 15 0 obj << /Type /Pages /Kids [16 0 R 17 0 R] /Count 2 /Parent 2 0 R >> endobj % Actual page objects 4 0 obj << /Type /Page /Contents 40 0 R /Parent 3 0 R >> endobj 5 0 obj << /Type /Page /Contents 41 0 R /Parent 3 0 R >> endobj % ... and so on |
實現遞迴頁面樹遍歷。
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
// CORRECT: Recursive page tree traversal function GetPagesInCorrectOrder(Doc: TPDFDocument): TPageArray; var CatalogObj, RootPagesObj: TPDFObject; PageList: TList; begin PageList := TList.Create; try // Step 1: Find the document catalog CatalogObj := Doc.FindObject('/Type', '/Catalog'); if CatalogObj = nil then raise Exception.Create('Document catalog not found'); // Step 2: Get the root Pages object RootPagesObj := CatalogObj.GetIndirectObject('/Pages'); if RootPagesObj = nil then raise Exception.Create('Root Pages object not found'); // Step 3: Recursively traverse the page tree TraversePagesTree(RootPagesObj, PageList); // Step 4: Convert list to array SetLength(Result, PageList.Count); for i := 0 to PageList.Count - 1 do Result[i] := TPDFObject(PageList[i]); finally PageList.Free; end; end; procedure TraversePagesTree(PagesObj: TPDFObject; PageList: TList); var KidsArray: TPDFArray; i: Integer; ChildObj: TPDFObject; ChildType: string; begin if PagesObj = nil then Exit; // Get the Kids array from this Pages object KidsArray := PagesObj.GetArray('/Kids'); if KidsArray = nil then Exit; // Process each child in the Kids array for i := 0 to KidsArray.Count - 1 do begin ChildObj := KidsArray.GetIndirectObject(i); if ChildObj = nil then Continue; ChildType := ChildObj.GetValue('/Type'); if ChildType = '/Page' then begin // This is a leaf page object - add it to our list PageList.Add(ChildObj); end else if ChildType = '/Pages' then begin // This is an intermediate Pages object - recurse into it TraversePagesTree(ChildObj, PageList); end else begin // Unexpected object type in Kids array raise Exception.CreateFmt('Unexpected object type in Kids array: %s', [ChildType]); end; end; end; |
處理實際 PDF 檔案的各種變體和邊緣情況。
實際 PDF 檔案通常會偏離規範中描述的理想結構。一個健壯的 PDF 處理庫必須能夠優雅地處理這些變體。
常見的結構異常。
1. 缺失或損壞的目錄。
|
1 2 3 4 5 6 |
% PDF with missing catalog reference %PDF-1.4 % Object 1 should be catalog but is missing or corrupted 2 0 obj << /Type /Pages /Kids [3 0 R 4 0 R] /Count 2 >> endobj |
2. 迴圈引用。
|
1 2 3 4 5 6 7 8 |
% PDF with circular page tree references (corrupted) 2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 /Parent 3 0 R >> endobj 3 0 obj << /Type /Pages /Kids [2 0 R] /Count 1 /Parent 2 0 R >> endobj |
3. 不一致的計數值。
|
1 2 3 4 5 |
% PDF with incorrect Count value 2 0 obj << /Type /Pages /Kids [3 0 R 4 0 R 5 0 R] /Count 5 >> % Count says 5 but Kids array has only 3 elements endobj |
實現強大的錯誤處理。
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
// Robust page tree traversal with comprehensive error handling function GetPagesWithFallbacks(Doc: TPDFDocument): TPageArray; var AttemptCount: Integer; ErrorMessages: TStringList; begin ErrorMessages := TStringList.Create; try AttemptCount := 0; // Attempt 1: Standard PDF specification approach Inc(AttemptCount); try Result := GetPagesViaStandardTraversal(Doc); if Length(Result) > 0 then begin LogMessage(Format('Success with standard traversal (attempt %d)', [AttemptCount])); Exit; end; except on E: Exception do ErrorMessages.Add(Format('Attempt %d failed: %s', [AttemptCount, E.Message])); end; // Attempt 2: Search for Pages objects and try each one Inc(AttemptCount); try Result := GetPagesViaObjectSearch(Doc); if Length(Result) > 0 then begin LogMessage(Format('Success with object search (attempt %d)', [AttemptCount])); Exit; end; except on E: Exception do ErrorMessages.Add(Format('Attempt %d failed: %s', [AttemptCount, E.Message])); end; // Attempt 3: Brute force search for Page objects Inc(AttemptCount); try Result := GetPagesViaBruteForce(Doc); if Length(Result) > 0 then begin LogMessage(Format('Success with brute force search (attempt %d)', [AttemptCount])); LogMessage('Warning: Document structure is non-standard'); Exit; end; except on E: Exception do ErrorMessages.Add(Format('Attempt %d failed: %s', [AttemptCount, E.Message])); end; // All attempts failed raise Exception.Create('Failed to extract pages from PDF. Errors: ' + ErrorMessages.Text); finally ErrorMessages.Free; end; end; function GetPagesViaObjectSearch(Doc: TPDFDocument): TPageArray; var i: Integer; Obj: TPDFObject; KidsArray: TPDFArray; PageList: TList; CandidateObjects: TList; begin CandidateObjects := TList.Create; PageList := TList.Create; try // Find all objects that could be Pages objects for i := 0 to Doc.Objects.Count - 1 do begin Obj := Doc.Objects[i]; if (Obj <> nil) and (Obj.GetValue('/Type') = '/Pages') and Obj.HasKey('/Kids') then begin CandidateObjects.Add(Obj); end; end; // Try each candidate Pages object for i := 0 to CandidateObjects.Count - 1 do begin Obj := TPDFObject(CandidateObjects[i]); KidsArray := Obj.GetArray('/Kids'); if (KidsArray <> nil) and (KidsArray.Count > 0) then begin // Validate that this Kids array contains actual pages if ValidateKidsArray(KidsArray) then begin PageList.Clear; TraversePagesTree(Obj, PageList); if PageList.Count > 0 then begin // Found valid pages - convert to array and return SetLength(Result, PageList.Count); for j := 0 to PageList.Count - 1 do Result[j] := TPDFObject(PageList[j]); Exit; end; end; end; end; // No valid Pages object found SetLength(Result, 0); finally CandidateObjects.Free; PageList.Free; end; end; |
效能最佳化策略
在處理大型 PDF 檔案或進行高吞吐量的文件處理時,效能成為一個關鍵的考慮因素。
延遲載入和快取.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
// Performance-optimized page access with caching type TPDFPageCache = class private FPages: array of TPDFPage; FPageObjects: array of TPDFObject; FCacheHits: Integer; FCacheMisses: Integer; FMaxCacheSize: Integer; public constructor Create(MaxCacheSize: Integer = 100); destructor Destroy; override; function GetPage(Index: Integer): TPDFPage; procedure ClearCache; procedure GetCacheStatistics(out Hits, Misses: Integer); end; function TPDFPageCache.GetPage(Index: Integer): TPDFPage; begin // Check if page is already cached if (Index >= 0) and (Index < Length(FPages)) and (FPages[Index] <> nil) then begin Inc(FCacheHits); Result := FPages[Index]; Exit; end; Inc(FCacheMisses); // Load page from object if not cached if (Index >= 0) and (Index < Length(FPageObjects)) and (FPageObjects[Index] <> nil) then begin Result := TPDFPage.CreateFromObject(FPageObjects[Index]); // Cache the page if we have room if Length(FPages) < FMaxCacheSize then begin if Index >= Length(FPages) then SetLength(FPages, Index + 1); FPages[Index] := Result; end; end else begin Result := nil; end; end; |
用於處理大型文件的流式處理.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
// Streaming approach for processing large PDF documents procedure ProcessLargePDFInChunks(const FileName: string; ChunkSize: Integer = 50); var Doc: TPDFDocument; TotalPages: Integer; ChunkStart, ChunkEnd: Integer; i: Integer; begin Doc := TPDFDocument.Create; try Doc.LoadFromFile(FileName); TotalPages := Doc.GetPageCount; LogMessage(Format('Processing %d pages in chunks of %d', [TotalPages, ChunkSize])); ChunkStart := 0; while ChunkStart < TotalPages do begin ChunkEnd := Min(ChunkStart + ChunkSize - 1, TotalPages - 1); LogMessage(Format('Processing chunk: pages %d-%d', [ChunkStart + 1, ChunkEnd + 1])); // Process this chunk of pages for i := ChunkStart to ChunkEnd do begin ProcessSinglePage(Doc, i); end; // Optional: Force garbage collection between chunks if (ChunkStart mod (ChunkSize * 4)) = 0 then begin ForceGarbageCollection; end; ChunkStart := ChunkEnd + 1; end; finally Doc.Free; end; end; |
高階PDF結構分析.
對於需要處理複雜PDF要求的開發人員,理解高階結構元素至關重要.
頁面繼承和資源管理.
PDF頁面可以從其父級Pages物件繼承屬性,從而建立一個分層資源管理系統.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
% Example of page inheritance in PDF structure 2 0 obj << /Type /Pages /Kids [3 0 R 4 0 R] /Count 2 /MediaBox [0 0 612 792] /Resources << /Font << /F1 10 0 R >> /ProcSet [/PDF /Text] >> >> endobj % Child page inherits MediaBox and Resources from parent 3 0 obj << /Type /Page /Parent 2 0 R /Contents 5 0 R >> % This page inherits MediaBox [0 0 612 792] and Resources from parent endobj % Child page overrides inherited MediaBox 4 0 obj << /Type /Page /Parent 2 0 R /Contents 6 0 R /MediaBox [0 0 792 612] >> % This page overrides MediaBox but still inherits Resources endobj |
在程式碼中處理頁面繼承.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
// Proper handling of page inheritance function GetEffectivePageProperties(PageObj: TPDFObject): TPDFPageProperties; var CurrentObj: TPDFObject; MediaBox: TPDFArray; Resources: TPDFObject; begin // Initialize result Result := TPDFPageProperties.Create; // Walk up the parent chain to collect inherited properties CurrentObj := PageObj; while CurrentObj <> nil do begin // Check for MediaBox at this level if Result.MediaBox.IsEmpty then begin MediaBox := CurrentObj.GetArray('/MediaBox'); if MediaBox <> nil then Result.MediaBox := MediaBox; end; // Check for Resources at this level if Result.Resources = nil then begin Resources := CurrentObj.GetDictionary('/Resources'); if Resources <> nil then Result.Resources := Resources; end; // Check for other inheritable properties CheckForInheritableProperty(CurrentObj, '/Rotate', Result.Rotate); CheckForInheritableProperty(CurrentObj, '/CropBox', Result.CropBox); // Move to parent object CurrentObj := CurrentObj.GetIndirectObject('/Parent'); // Prevent infinite loops in corrupted PDFs if CurrentObj = PageObj then break; end; // Validate that we found required properties if Result.MediaBox.IsEmpty then raise Exception.Create('No MediaBox found in page inheritance chain'); end; |
PDF頁面排序的測試策略.
面對PDF頁面排序時,全面的測試至關重要,因為文件結構可能存在多種變化。
建立全面的測試套件。
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# Comprehensive PDF test case generation script # Test Case 1: Sequential pages (baseline) echo "Creating sequential page test..." pdftk A=template.pdf cat A A A output test-sequential.pdf # Test Case 2: Non-sequential object IDs echo "Creating non-sequential object ID test..." pdftk A=page3.pdf B=page1.pdf C=page2.pdf cat A B C output test-nonsequential.pdf # Test Case 3: Hierarchical page tree echo "Creating hierarchical page tree test..." # This requires custom PDF generation tool generate-hierarchical-pdf --depth 3 --pages-per-node 2 output test-hierarchical.pdf # Test Case 4: Large document with mixed structures echo "Creating large document test..." pdftk A=large-doc.pdf cat 1-100 50-149 200-299 output test-large-mixed.pdf # Test Case 5: Corrupted page tree echo "Creating corrupted page tree test..." # This requires custom corruption tool corrupt-pdf-structure --target pages-tree test-sequential.pdf test-corrupted.pdf # Test Case 6: Minimal single-page document echo "Creating minimal single-page test..." pdftk A=template.pdf cat 1 output test-single-page.pdf |
自動化驗證框架。
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
// Comprehensive PDF page ordering validation framework type TPDFTestCase = record FileName: string; ExpectedPageCount: Integer; ExpectedPageOrder: array of Integer; Description: string; end; function RunPDFPageOrderingTests: Boolean; var TestCases: array of TPDFTestCase; i: Integer; PassCount, FailCount: Integer; begin // Define test cases SetLength(TestCases, 6); TestCases[0].FileName := 'test-sequential.pdf'; TestCases[0].ExpectedPageCount := 3; TestCases[0].ExpectedPageOrder := [0, 1, 2]; TestCases[0].Description := 'Sequential page ordering'; TestCases[1].FileName := 'test-nonsequential.pdf'; TestCases[1].ExpectedPageCount := 3; TestCases[1].ExpectedPageOrder := [2, 0, 1]; // Based on how pdftk reorders TestCases[1].Description := 'Non-sequential object IDs'; // ... define other test cases ... PassCount := 0; FailCount := 0; WriteLn('Running PDF page ordering tests...'); WriteLn('=' * 50); for i := 0 to High(TestCases) do begin Write(Format('Test %d: %s... ', [i + 1, TestCases[i].Description])); if ValidateTestCase(TestCases[i]) then begin WriteLn('PASS'); Inc(PassCount); end else begin WriteLn('FAIL'); Inc(FailCount); end; end; WriteLn('=' * 50); WriteLn(Format('Results: %d passed, %d failed', [PassCount, FailCount])); Result := FailCount = 0; end; function ValidateTestCase(const TestCase: TPDFTestCase): Boolean; var Doc: TPDFDocument; ActualPages: TPageArray; i: Integer; begin Result := False; Doc := TPDFDocument.Create; try if not Doc.LoadFromFile(TestCase.FileName) then begin WriteLn(Format('Failed to load %s', [TestCase.FileName])); Exit; end; ActualPages := GetPagesInCorrectOrder(Doc); // Validate page count if Length(ActualPages) <> TestCase.ExpectedPageCount then begin WriteLn(Format('Page count mismatch: expected %d, got %d', [TestCase.ExpectedPageCount, Length(ActualPages)])); Exit; end; // Validate page order (simplified - in real implementation, // you'd compare actual page content or identifiers) for i := 0 to High(ActualPages) do begin if not ValidatePageAtPosition(ActualPages[i], TestCase.ExpectedPageOrder[i]) then begin WriteLn(Format('Page order mismatch at position %d', [i])); Exit; end; end; Result := True; finally Doc.Free; end; end; |
為PDF處理程式碼進行未來相容性設計。
隨著PDF標準的演進和新用例的出現,編寫能夠適應未來需求的程式碼非常重要。
針對可擴充套件性進行設計。
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
// Extensible PDF page processing architecture type IPDFPageProcessor = interface ['{12345678-1234-1234-1234-123456789012}'] function ProcessPage(Page: TPDFPage; Context: TPDFProcessingContext): Boolean; function GetProcessorName: string; function GetSupportedPDFVersions: TStringArray; end; TPDFProcessingPipeline = class private FProcessors: TList; FContext: TPDFProcessingContext; public constructor Create; destructor Destroy; override; procedure RegisterProcessor(Processor: IPDFPageProcessor); procedure UnregisterProcessor(Processor: IPDFPageProcessor); function ProcessDocument(Doc: TPDFDocument): Boolean; end; function TPDFProcessingPipeline.ProcessDocument(Doc: TPDFDocument): Boolean; var Pages: TPageArray; i, j: Integer; Page: TPDFPage; Processor: IPDFPageProcessor; Success: Boolean; begin Result := True; // Get pages in correct order using our robust method Pages := GetPagesInCorrectOrder(Doc); // Process each page through all registered processors for i := 0 to High(Pages) do begin Page := TPDFPage.CreateFromObject(Pages[i]); try FContext.CurrentPageIndex := i; FContext.TotalPages := Length(Pages); for j := 0 to FProcessors.Count - 1 do begin Processor := FProcessors[j]; Success := Processor.ProcessPage(Page, FContext); if not Success then begin LogError(Format('Processor %s failed on page %d', [Processor.GetProcessorName, i + 1])); Result := False; // Continue with other processors/pages or break based on policy end; end; finally Page.Free; end; end; end; |
投入精力理解正確的PDF結構,可以減少支援負擔,提高使用者滿意度,並在應用程式的整個生命週期內簡化維護。PDF頁面排序不僅僅是一個技術細節,而是文件完整性的一個基本方面,它直接影響使用者體驗。掌握這種複雜性,您將構建使用者可以放心地用於儲存其重要文件的PDF應用程式。