PDF 페이지 트리 이해: 페이지 순서가 중요한 이유

PDF 문서는 겉보기에는 단순해 보일 수 있지만, 내부 구조는 놀라울 정도로 복잡할 수 있습니다. 개발자가 종종 어려움을 겪는 부분 중 하나는 PDF 페이지 순서가 실제로 어떻게 작동하는지 이해하는 것입니다. 저희의 PDF 페이지 복사 샘플 프로그램의 수정 및 개선 과정에서 이러한 까다로운 문제에 직면했습니다. HotPDF Delphi PDF Component를 사용하면서,

PDF 문서 아키텍처

핵심 개념

PDF 문서의 핵심은 객체 데이터베이스와 같습니다. 각 객체는 고유한 식별자를 가지며 다른 객체를 참조할 수 있습니다. 이를 통해 문서 카탈로그(루트)가 문서의 다양한 부분으로 들어가는 진입점 역할을 하는 복잡한 데이터 구조가 만들어집니다.

PDF를 빙산이라고 생각해보세요. 문서를 볼 때 보이는 것은 표면일 뿐이며, 그 아래에는 문서의 모든 측면의 모양과 동작을 정의하는 객체, 참조 및 메타데이터의 정교한 구조가 있습니다.

객체 참조 시스템

1 0 obj <- Object 1

/Type /Page

/Parent 3 0 R

/Contents 4 0 R

/MediaBox [0 0 612 792]

/Resources 5 0 R

endobj

모든 PDF 객체는 다음 패턴을 따릅니다. ObjectNumber Generation obj. R 참조에서와 같이, 3 0 R 뒤에 붙는 숫자는 "객체 3에 대한 참조, 세대 0"을 의미합니다.

세대 번호 이해

세대 번호（通常在现代PDF中为0）具有重要的作用：

세대 0: 원본 객체
Generation 1+: 업데이트된 버전 (점진적 업데이트에 사용)
Generation 65535: 삭제된 객체 마커

% Original object

5 0 obj

<< /Type /Page /Contents 6 0 R >>

endobj

% Updated version (incremental update)

5 1 obj

<< /Type /Page /Contents 6 0 R /Rotate 90 >>

endobj

PDF 파일 구조 개요

PDF 파일은 네 가지 주요 부분으로 구성됩니다.

헤더: 버전 정보 (%PDF-1.7)
본문: 객체 정의 및 데이터
상호 참조 테이블: 객체 위치 인덱스
트레일러: 루트 참조 및 파일 메타데이터

%PDF-1.7 <- Header

1 0 obj << /Type /Catalog ... >> <- Body (objects)

2 0 obj << /Type /Pages ... >>

...

xref <- Cross-reference table

0 10

0000000000 65535 f

0000000009 00000 n

...

trailer <- Trailer

<< /Size 10 /Root 1 0 R >>

startxref

1234

%%EOF

페이지 트리 구조

페이지 트리 개념

PDF는 페이지를 구성하기 위해 계층적 트리 구조를 사용하며, 이는 파일 시스템이 디렉터리를 구성하는 방식과 유사합니다. 이러한 설계는 여러 가지 목적을 수행합니다.

효율적인 탐색: 전체 문서를 파싱하지 않고도 모든 페이지에 빠르게 접근할 수 있습니다.
페이지 상속: 일반 속성은 부모 노드에서 상속될 수 있습니다.
확장성: 효율적으로 수천 페이지의 문서를 처리합니다.
유연성: 복잡한 문서 구조와 중첩 섹션을 지원합니다.

Root Catalog

↓

Pages Tree Root (/Type /Pages)

↓

Kids Array → [Page1, Page2, Page3, ...]

↓ ↓ ↓

/Type /Page /Type /Page /Type /Page

실제 예제: 간단한 페이지 트리

PDF 파일에서 일반적인 페이지 트리의 모습은 다음과 같습니다.

16 0 obj (Pages Tree Root)

/Type /Pages

/Count 3

/Kids [

20 0 R <- Reference to first page

1 0 R <- Reference to second page

4 0 R <- Reference to third page

]

/MediaBox [0 0 612 792] <- Inherited by all pages

endobj

20 0 obj (First Page)

/Type /Page

/Parent 16 0 R

/Contents 21 0 R

/Resources 22 0 R

endobj

1 0 obj (Second Page)

/Type /Page

/Parent 16 0 R

/Contents 2 0 R

/Resources 3 0 R

/Rotate 90

endobj

4 0 obj (Third Page)

/Type /Page

/Parent 16 0 R

/Contents 5 0 R

/Resources 6 0 R

endobj

중요한 사항: Kids 배열은 논리적인 페이지 순서를 정의하며, 파일 내 객체의 물리적인 순서를 나타내지 않습니다.

qpdf 출력에서 가져온 실제 예시

다음은 문제 있는 PDF 파일에서 qpdf --show-pages 생성된 실제 출력 결과입니다.

page 1: 20 0 R

content: 192 0 R

page 2: 1 0 R

content: 190 0 R

page 3: 4 0 R

content: 188 0 R

주목해야 할 점은:

논리 페이지 1 는 저장됩니다. 객체 20 (가장 높은 객체 번호)
논리 페이지 2 는 저장됩니다. 객체 1 (가장 낮은 객체 번호)
논리 페이지 3 는 저장됩니다. 객체 4 (중간 객체 번호)

파싱 코드에서 객체를 숫자 순서(1, 4, 20)로 처리하면 잘못된 페이지 순서(2, 3, 1)를 얻게 되며, 이는 올바른 논리적 순서(1, 2, 3)가 아닙니다.

복잡한 예제: 중첩된 페이지 트리

큰 문서에서는 종종 중첩된 페이지 트리를 사용하여 더 나은 조직을 제공합니다.

1 0 obj (Document Catalog)

/Type /Catalog

/Pages 2 0 R

endobj

2 0 obj (Root Pages Node)

/Type /Pages

/Count 8

/Kids [3 0 R 4 0 R] <- Two intermediate nodes

endobj

3 0 obj (Chapter 1 Pages)

/Type /Pages

/Parent 2 0 R

/Count 5

/Kids [10 0 R 11 0 R 12 0 R 13 0 R 14 0 R]

/MediaBox [0 0 612 792]

endobj

4 0 obj (Chapter 2 Pages)

/Type /Pages

/Parent 2 0 R

/Count 3

/Kids [20 0 R 21 0 R 22 0 R]

/MediaBox [0 0 612 792]

endobj

% Individual page objects follow...

10 0 obj << /Type /Page /Parent 3 0 R ... >>

11 0 obj << /Type /Page /Parent 3 0 R ... >>

...

이렇게 하면 트리 구조가 생성됩니다.

Root (8 pages)

├── Chapter 1 (5 pages)

│ ├── Page 1 (10 0 R)

│ ├── Page 2 (11 0 R)

│ ├── Page 3 (12 0 R)

│ ├── Page 4 (13 0 R)

│ └── Page 5 (14 0 R)

└── Chapter 2 (3 pages)

├── Page 6 (20 0 R)

├── Page 7 (21 0 R)

└── Page 8 (22 0 R)

페이지 트리 속성

필수 속성:

/Type: 반드시 /Pages 중간 노드의 경우 또는 /Page 리프 노드의 경우
/Kids: 자식 페이지 참조 배열 (仅限中间节点)
/Count: 后代页面的总数
/Parent: 父节点引用 (根节点除外)

선택적 상속 가능한 속성:

/MediaBox: 페이지 크기
/CropBox: 표시 영역
/BleedBox: 인쇄 출혈 영역
/TrimBox: 최종 트리밍 페이지 크기
/ArtBox: 의미 있는 콘텐츠 영역
/Resources: 글꼴, 이미지, 그래픽 상태
/Rotate: 페이지 회전 (0, 90, 180, 270도)

: 흔한 오해

오류 #1: 순차적인 객체 번호 = 페이지 순서라고 가정

많은 개발자들이 PDF 파일에서 페이지가 객체 1, 2, 3으로 저장되어 있다면, 객체 1이 페이지 1이라고 가정합니다. 이는 근본적으로 잘못된 생각이며, 미묘한 버그를 유발합니다.

이러한 가정이 실패하는 이유:

객체 번호는 페이지 순서가 아닌 PDF 생성 시 할당됩니다.
PDF 편집기는 최적화 과정에서 객체 번호를 재할당할 수 있습니다.
점진적인 업데이트는 더 높은 번호의 새로운 객체를 추가합니다.
객체 스트림은 번호 체계를 변경할 수 있습니다.

현실객체 번호는 단순히 식별자일 뿐이며, 실제 페이지 순서는 Pages 트리의 Kids 배열에 의해 결정됩니다.

실제 예시:

% These pages were created in order: Page 1, Page 2, Page 3

% But stored in PDF with these object numbers:

150 0 obj << /Type /Page ... >> % Actually page 1

23 0 obj << /Type /Page ... >> % Actually page 2

8 0 obj << /Type /Page ... >> % Actually page 3

% The Pages tree defines the correct order:

16 0 obj

/Type /Pages

/Kids [150 0 R 23 0 R 8 0 R] % Logical order

실수 #2: 물리적 파일 순서대로 페이지 처리

PDF 파일에서 객체를 순차적으로 읽으면 페이지가 올바른 순서로 나타나지 않습니다.

예시 문제::

파일에는 물리적 순서로 객체가 포함되어 있습니다: 1, 4, 16, 20
Pages 트리의 Kids 배열: [20 0 R, 1 0 R, 4 0 R]
올바른 논리적 페이지 순서: 객체 20 (페이지 1), 객체 1 (페이지 2), 객체 4 (페이지 3)
잘못된 물리적 파일 순서: 객체 1 (페이지 2), 객체 4 (페이지 3), 객체 16 (페이지 아님), 객체 20 (페이지 1)

발생 원인:

PDF 생성기는 파일 크기를 최적화하며, 페이지 순서를 고려하지 않습니다.
객체 스트림은 콘텐츠를 재구성할 수 있습니다.
선형화는 웹 뷰를 위한 객체 순서를 변경합니다.
여러 편집 도구는 변경 사항을 겹쳐서 적용할 수 있습니다.

실수 #3: 문서 카탈로그를 무시하는 경우

일부 파싱 코드는 올바른 체인(Root → Pages → Kids)을 따르지 않고 페이지를 직접 찾으려고 시도합니다.

문제 있는 접근 방식:

// Wrong: Direct page search

for i := 0 to Objects.Count - 1 do

begin

if Objects[i].GetValue('/Type') = '/Page' then

AddToPageList(Objects[i]); // Wrong order!

end;

올바른 접근 방식:

// Right: Follow the document structure

CatalogObj := FindObjectByReference(TrailerRoot);

PagesObj := FindObjectByReference(CatalogObj.GetValue('/Pages'));

KidsArray := PagesObj.GetValue('/Kids');

for i := 0 to KidsArray.Count - 1 do

begin

PageRef := KidsArray.GetReference(i);

PageObj := FindObjectByReference(PageRef);

AddToPageList(PageObj); // Correct order!

end;

실수 #4: 중첩된 페이지 트리를 처리하지 않음

모든 페이지 트리가 평면적(단일 레벨)이라고 가정하는 것은 복잡한 문서 구조를 놓칠 수 있습니다.

간단한 트리 (종종 가정됨):

Pages Root

├── Page 1

├── Page 2

└── Page 3

실제 복잡한 트리:

Pages Root

├── Part 1 Pages

│ ├── Chapter 1 Pages

│ │ ├── Page 1

│ │ └── Page 2

│ └── Chapter 2 Pages

│ ├── Page 3

│ └── Page 4

└── Part 2 Pages

└── Page 5

재귀적 구조 처리:

procedure ProcessPageNode(Node: TPDFObject; var PageList: TPageList);

begin

if Node.GetValue('/Type') = '/Pages' then

begin

// Intermediate node - process all kids

KidsArray := Node.GetValue('/Kids');

for i := 0 to KidsArray.Count - 1 do

begin

ChildRef := KidsArray.GetReference(i);

ChildObj := FindObjectByReference(ChildRef);

ProcessPageNode(ChildObj, PageList); // Recursive call

end;

end

else if Node.GetValue('/Type') = '/Page' then

begin

// Leaf node - actual page

PageList.Add(Node);

end;

실수 #5: 페이지 상속 무시

상속된 속성을 고려하지 않으면 페이지 렌더링이 올바르게 수행되지 않습니다.

상속 체인 예제:

Root Pages (/MediaBox [0 0 612 792], /Resources 10 0 R)

├── Chapter Pages (/Rotate 90)

│ └── Page 1 (/Contents 20 0 R)

└── Page 2 (/Contents 21 0 R, /MediaBox [0 0 595 842])

유효한 속성:

페이지 1: MediaBox=[0,0,612,792] (상속됨), Rotate=90 (상속됨), Resources=10 0 R (상속됨), Contents=20 0 R
페이지 2: MediaBox=[0,0,595,842] (덮어쓰기), Rotate=0 (상속되지 않음), Resources=10 0 R (상속됨), Contents=21 0 R

구현 (HotPDF 컴포넌트):

function GetEffectivePageProperties(PageObj: TPDFDictionary): TPDFDictionary;

var

EffectiveProps: TPDFDictionary;

CurrentNode: TPDFDictionary;

begin

EffectiveProps := TPDFDictionary.Create;

CurrentNode := PageObj;

// Walk up the tree collecting inherited properties

while CurrentNode <> nil do

begin

// Add properties not already set (inheritance chain)

if not EffectiveProps.HasKey('/MediaBox') and CurrentNode.HasKey('/MediaBox') then

EffectiveProps.SetValue('/MediaBox', CurrentNode.GetValue('/MediaBox'));

if not EffectiveProps.HasKey('/Resources') and CurrentNode.HasKey('/Resources') then

EffectiveProps.SetValue('/Resources', CurrentNode.GetValue('/Resources'));

// ... other inheritable properties

// Move to parent

if CurrentNode.HasKey('/Parent') then

CurrentNode := FindObjectByReference(CurrentNode.GetValue('/Parent'))

else

CurrentNode := nil;

end;

Result := EffectiveProps;

end;

실수 #6: Count 값의 정확성을 가정

때로는 /Count 페이지 트리 노드의 값들이 실제 페이지 수와 일치하지 않습니다.

문제:

Pages Root

/Count 5 <- Claims 5 pages

/Kids [A B C] <- But only 3 direct children

Node A: /Count 2, /Kids [Page1, Page2]

Node B: /Count 1, /Kids [Page3]

Node C: /Count 3, /Kids [Page4, Page5, Page6] <- 3 pages, not matching parent count

방어적 프로그래밍:

// HotPDF VCL Component code snippet

function CountActualPages(PagesNode: TPDFDictionary): Integer;

var

ActualCount: Integer;

KidsArray: TPDFArray;

i: Integer;

ChildObj: TPDFDictionary;

begin

ActualCount := 0;

KidsArray := PagesNode.GetValue('/Kids');

for i := 0 to KidsArray.Count - 1 do

begin

ChildObj := FindObjectByReference(KidsArray.GetReference(i));

if ChildObj.GetValue('/Type') = '/Page' then

Inc(ActualCount)

else if ChildObj.GetValue('/Type') = '/Pages' then

Inc(ActualCount, CountActualPages(ChildObj));

end;

// Verify against claimed count

ClaimedCount := PagesNode.GetValue('/Count');

if ClaimedCount <> ActualCount then

WriteLn('Warning: Count mismatch - claimed: ', ClaimedCount, ', actual: ', ActualCount);

Result := ActualCount;

end;

페이지를 올바르게 파싱하는 방법

1단계: 문서 루트 찾기

// Find trailer and get Root reference

RootRef := GetTrailerRootReference();

RootObject := FindObject(RootRef);

2단계: 페이지 트리로 이동

// Get Pages reference from Root catalog

PagesRef := RootObject.GetValue('/Pages');

PagesObject := FindObject(PagesRef);

3단계: 자식 배열을 순서대로 처리

// Extract Kids array - this defines page order

KidsArray := PagesObject.GetValue('/Kids');

// Process each page in the order specified by Kids

for i := 0 to KidsArray.Count - 1 do

begin

PageRef := KidsArray[i];

PageObject := FindObject(PageRef);

// Now you have the actual page i+1

end;

고급 개념

중첩 페이지 트리

큰 문서의 경우, 더 나은 구성으로 중첩 페이지 트리를 사용할 수 있습니다.

Root Pages

├── Chapter 1 Pages

│ ├── Page 1

│ ├── Page 2

│ └── Page 3

└── Chapter 2 Pages

├── Page 4

└── Page 5

페이지 상속

페이지는 부모 페이지 트리 노드에서 속성을 상속받을 수 있으며, 예를 들어:

MediaBox (페이지 크기)
CropBox (표시 영역)
Resources (글꼴, 이미지)
Rotation (회전)

실용적인 구현 팁

1. 항상 트리 구조를 따르세요.

// Wrong: Assumes sequential object order

PageObject := GetObject(PageNumber);

// Right: Follows Pages tree structure

PageObject := GetPageFromKidsArray(PageNumber - 1);

2. 재귀적인 페이지 트리를 처리하세요.

일부 PDF 파일에는 여러 수준의 페이지 트리 노드가 있습니다. 귀하의 코드는 트리를 재귀적으로 탐색해야 합니다.

procedure ProcessPageNode(Node: TPDFObject);

begin

if Node.Type = 'Pages' then

begin

// Intermediate node - process Kids

for each Kid in Node.Kids do

ProcessPageNode(Kid);

end

else if Node.Type = 'Page' then

begin

// Leaf node - actual page

AddPageToArray(Node);

end;

3. 페이지 수를 확인하세요.

항상 다음을 확인하세요. /Count Pages 객체에 있는 값이 실제로 발견된 페이지 수와 일치하는지:

ExpectedCount := PagesObject.GetValue('/Count');

ActualCount := CountPagesInTree(PagesObject);

if ExpectedCount <> ActualCount then

RaiseError('Page count mismatch');

PDF 페이지 문제 디버깅

일반적인 증상

잘못된 페이지 추출: 일반적으로 Kids 배열의 순서를 무시하는 것을 나타냅니다.
누락된 페이지: 종종 중첩된 페이지 트리를 처리하지 못할 때 발생합니다.
중복된 페이지: 중간 노드와 리프 노드를 모두 처리할 때 발생할 수 있습니다.

디버깅 기술

페이지 트리 구조를 기록합니다.:

1 2	WriteLn('Pages tree Kids: [', KidsArrayToString(Kids), ']'); WriteLn('Processing page object: ', PageObjectNumber);

페이지 내용을 확인합니다.작은 샘플을 추출하여 예상 내용과 일치하는지 확인합니다.
외부 도구를 사용합니다.다음과 같은 도구들이 qpdf 또는 pdftk PDF 구조 분석에 도움이 될 수 있습니다.

최적의 방법.

1. 올바른 데이터 구조를 구축합니다.

내부 페이지 배열을 PDF의 논리적 페이지 순서와 동일하게 만듭니다.

// Build PageArray following Kids order

SetLength(PageArray, PageCount);

for i := 0 to KidsArray.Count - 1 do

begin

PageRef := KidsArray[i];

PageArray[i] := FindObject(PageRef);

end;

2. 파싱과 처리를 분리합니다.

먼저 전체 페이지 구조를 파싱한 다음, 작업을 수행합니다. 문서 구조를 파싱하는 동안 페이지를 처리하려고 하지 마십시오.

3. 예외 상황을 처리합니다.

페이지가 없는 문서 (0 페이지).
단일 페이지 문서.
페이지 방향이 혼합된 문서.
상속된 속성을 가진 문서

고급 PDF 객체 유형

PDF 객체 계층 구조 이해

기본 페이지 객체 외에도, PDF에는 수많은 특수 객체 유형이 포함되어 있으며, 이러한 객체들이 함께 작동하여 전체 문서를 구성합니다.

Document Catalog (Root)

├── Pages Tree

├── Outlines (Bookmarks)

├── Names Dictionary

├── Dests (Named Destinations)

├── ViewerPreferences

├── PageLabels

├── Metadata

├── StructTreeRoot (Tagged PDF)

├── MarkInfo

├── Lang

├── SpiderInfo

├── OutputIntents

├── PieceInfo

├── AcroForm (Interactive Forms)

├── Encrypt (Security)

└── Extensions

콘텐츠 스트림 객체

페이지 콘텐츠는 드로잉 명령을 포함하는 스트림 객체에 저장됩니다.

5 0 obj (Content Stream)

/Length 1274

/Filter /FlateDecode

stream

BT % Begin text

/F1 12 Tf % Set font (F1) and size (12)

100 700 Td % Move to position (100, 700)

(Hello World) Tj % Show text "Hello World"

ET % End text

Q % Save graphics state

q % Restore graphics state

endstream

endobj

리소스 객체

리소스는 콘텐츠 스트림에서 사용되는 글꼴, 이미지 및 그래픽 상태를 정의합니다.

6 0 obj (Resources)

/Font <<

/F1 7 0 R % Font resource

/F2 8 0 R

/XObject <<

/Im1 9 0 R % Image resource

/ExtGState <<

/GS1 10 0 R % Graphics state

/ColorSpace <<

/CS1 11 0 R % Color space

endobj

글꼴 객체

글꼴은 여러 하위 유형을 가진 복잡한 객체입니다.

7 0 obj (Type 1 Font)

/Type /Font

/Subtype /Type1

/BaseFont /Helvetica

/Encoding /WinAnsiEncoding

endobj

8 0 obj (TrueType Font)

/Type /Font

/Subtype /TrueType

/BaseFont /ArialMT

/FirstChar 32

/LastChar 126

/Widths [278 278 355 ...]

/FontDescriptor 12 0 R

endobj

전문 PDF 분석 도구

명령줄 도구

QPDF – PDF를 위한 만능 도구:

# Show page tree structure and page order

qpdf --show-pages input.pdf

# Show detailed page information in JSON format

qpdf --json=latest --json-key=pages input.pdf

# Validate PDF structure

qpdf --check input.pdf

# Show cross-reference table

qpdf --show-xref input.pdf

# Show specific object (e.g., pages tree root)

qpdf --show-object="16 0 R" input.pdf

# Show encryption details

qpdf --show-encryption input.pdf

# Show filtered stream data

qpdf --filtered-stream-data input.pdf

# Show complete document structure in JSON

qpdf --json input.pdf

CPDF – 일관성 있는 PDF 명령줄 도구:

# Get comprehensive PDF information in JSON format

cpdf -info-json input.pdf

# Get detailed page information with boxes and rotation

cpdf -page-info-json input.pdf

# List all fonts with encoding and type information

cpdf -list-fonts-json input.pdf

# List images with dimensions, color space, and compression

cpdf -list-images-json input.pdf

# View specific PDF objects (great for debugging)

cpdf -obj 16 input.pdf

# Output: <</Count 3/Kids[20 0 R 1 0 R 4 0 R]/Type/Pages>>

# Analyze document composition and size breakdown

cpdf -composition-json input.pdf

# Shows percentage of images, fonts, content streams, etc.

# List bookmarks in JSON format

cpdf -list-bookmarks-json input.pdf

# Export complete PDF structure as JSON for detailed analysis

cpdf -output-json input.pdf -o structure.json

PDFtk – PDF 도구 모음:

# Dump document metadata

pdftk input.pdf dump_data

# Show bookmarks

pdftk input.pdf dump_data | grep -A 5 "Bookmark"

# Extract specific pages

pdftk input.pdf cat 1-3 output pages_1_to_3.pdf

# Rotate pages

pdftk input.pdf cat 1-endright output rotated.pdf

MuPDF 도구:

# Show PDF structure

mutool show input.pdf

# Extract text with positioning

mutool draw -F txt input.pdf

# Convert to HTML (preserves structure)

mutool convert -F html input.pdf output.html

# Show object details

mutool show input.pdf 1 0 R

데스크톱 분석 도구

PDF 탐색기 (상용):

문서 구조의 시각적 트리 뷰
객체 속성의 실시간 편집
교차 참조 검증
스트리밍 디코딩 및 보기

PDF 디버거 (Adobe):

단계별 PDF 렌더링
구문 강조 기능을 갖춘 객체 검사기.
콘텐츠 스트림 분석.
오류 감지 및 보고.

분석을 위한 프로그래밍 라이브러리.

Python:

import PyPDF2

import fitz # PyMuPDF

# PyPDF2 analysis

with open('input.pdf', 'rb') as file:

reader = PyPDF2.PdfFileReader(file)

# Show page tree structure

pages_obj = reader.trailer['/Root']['/Pages']

print(f"Pages object: {pages_obj}")

# Show each page's properties

for i in range(reader.numPages):

page = reader.getPage(i)

print(f"Page {i+1}: {page}")

# PyMuPDF detailed analysis

doc = fitz.open('input.pdf')

for page_num in range(doc.page_count):

page = doc[page_num]

# Get page dictionary

page_dict = page.get_contents()

print(f"Page {page_num + 1} contents: {len(page_dict)} bytes")

# Get text with positioning

blocks = page.get_text("dict")

for block in blocks["blocks"]:

if "lines" in block:

for line in block["lines"]:

for span in line["spans"]:

print(f"Text: '{span['text']}' at {span['bbox']}")

JavaScript (PDF.js):

// Load and analyze PDF

pdfjsLib.getDocument('input.pdf').promise.then(function(pdf) {

// Get page count

console.log('Page count:', pdf.numPages);

// Analyze each page

for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {

pdf.getPage(pageNum).then(function(page) {

// Get page annotations

page.getAnnotations().then(function(annotations) {

console.log(`Page ${pageNum} annotations:`, annotations);

});

// Get text content

page.getTextContent().then(function(textContent) {

console.log(`Page ${pageNum} text items:`, textContent.items.length);

});

}

});

성능 고려 사항

효율적인 페이지 트리 탐색.

대용량 문서를 처리할 때, 효율적인 탐색은 매우 중요합니다.

// HotPDF Component code snippet

// Optimized page tree traversal with caching

type

TPageCache = class

private

FPageObjects: TDictionary<Integer, TPDFPageObject>;

FPageTree: TPDFPagesTree;

public

function GetPage(PageNumber: Integer): TPDFPageObject;

procedure PreloadPageRange(StartPage, EndPage: Integer);

procedure ClearCache;

end;

function TPageCache.GetPage(PageNumber: Integer): TPDFPageObject;

begin

// Check cache first

if FPageObjects.ContainsKey(PageNumber) then

Exit(FPageObjects[PageNumber]);

// Load on demand

Result := FPageTree.LoadPage(PageNumber);

FPageObjects.Add(PageNumber, Result);

end;

procedure TPageCache.PreloadPageRange(StartPage, EndPage: Integer);

var

I: Integer;

PageObj: TPDFPageObject;

begin

// Batch load for better performance

for I := StartPage to EndPage do

begin

if not FPageObjects.ContainsKey(I) then

begin

PageObj := FPageTree.LoadPage(I);

FPageObjects.Add(I, PageObj);

end;

메모리 관리

큰 PDF 파일은 주의 깊은 메모리 관리가 필요합니다.

// losLab HotPDF Component code snippet

// Memory-efficient PDF processing

type

TPDFProcessor = class

private

FMemoryLimit: Int64;

FCurrentMemoryUsage: Int64;

procedure CheckMemoryUsage;

procedure FlushCaches;

public

procedure ProcessPagesInBatches(PDF: TPDFDocument; BatchSize: Integer);

end;

procedure TPDFProcessor.ProcessPagesInBatches(PDF: TPDFDocument; BatchSize: Integer);

var

I, StartPage, EndPage: Integer;

PageCount: Integer;

Batch: TList<TPDFPageObject>;

begin

PageCount := PDF.GetPageCount;

StartPage := 1;

while StartPage <= PageCount do

begin

EndPage := Min(StartPage + BatchSize - 1, PageCount);

Batch := TList<TPDFPageObject>.Create;

try

// Load batch of pages

for I := StartPage to EndPage do

begin

Batch.Add(PDF.GetPage(I));

CheckMemoryUsage;

end;

// Process batch

ProcessPageBatch(Batch);

finally

// Clean up batch

Batch.Free;

FlushCaches;

end;

StartPage := EndPage + 1;

end;

지연 로딩 전략

큰 문서에 대한 지연 로딩을 구현합니다.

// Lazy-loaded page tree

type

TLazyPDFPage = class

private

FPageReference: TPDFReference;

FPageObject: TPDFPageObject;

FLoaded: Boolean;

function GetPageObject: TPDFPageObject;

public

constructor Create(PageRef: TPDFReference);

property PageObject: TPDFPageObject read GetPageObject;

property IsLoaded: Boolean read FLoaded;

procedure Unload; // Free memory when not needed

end;

function TLazyPDFPage.GetPageObject: TPDFPageObject;

begin

if not FLoaded then

begin

WriteLn('[DEBUG] Loading page from reference ', FPageReference.ObjectNumber);

FPageObject := LoadObjectFromReference(FPageReference);

FLoaded := True;

end;

Result := FPageObject;

end;

procedure TLazyPDFPage.Unload;

begin

if FLoaded then

begin

WriteLn('[DEBUG] Unloading page ', FPageReference.ObjectNumber);

FPageObject.Free;

FPageObject := nil;

FLoaded := False;

end;

오류 처리 및 유효성 검사

강력한 PDF 파싱

잘못된 형식 또는 손상된 PDF 파일을 안전하게 처리합니다.

// losLab Software Development code snippet

// Defensive PDF parsing with error recovery

type

TPDFParseResult = (prSuccess, prWarning, prError, prCriticalError);

function ParsePDFWithRecovery(FileName: string): TPDFParseResult;

var

PDF: TPDFDocument;

ErrorCount: Integer;

WarningCount: Integer;

begin

Result := prSuccess;

ErrorCount := 0;

WarningCount := 0;

try

PDF := TPDFDocument.Create;

try

// Basic file validation

if not ValidatePDFHeader(FileName) then

begin

WriteLn('[ERROR] Invalid PDF header');

Inc(ErrorCount);

end;

// Load with error recovery

if not PDF.LoadFromFileWithRecovery(FileName) then

begin

WriteLn('[ERROR] Failed to load PDF structure');

Inc(ErrorCount);

end;

// Validate page tree

case ValidatePageTree(PDF) of

vtValid:

WriteLn('[INFO] Page tree is valid');

vtWarning:

begin

WriteLn('[WARN] Page tree has minor issues');

Inc(WarningCount);

end;

vtError:

begin

WriteLn('[ERROR] Page tree is corrupted');

Inc(ErrorCount);

end;

// Validate cross-references

if not ValidateXRefTable(PDF) then

begin

WriteLn('[WARN] Cross-reference table has issues, attempting repair');

if RepairXRefTable(PDF) then

Inc(WarningCount)

else

Inc(ErrorCount);

end;

// Determine result based on error counts

if ErrorCount > 0 then

Result := prError

else if WarningCount > 0 then

Result := prWarning

else

Result := prSuccess;

finally

PDF.Free;

end;

except

on E: Exception do

begin

WriteLn('[CRITICAL] Exception during PDF parsing: ', E.Message);

Result := prCriticalError;

end;

유효성 검사 체크리스트

포괄적인 유효성 검사를 구현합니다.

// losLab Software code snippet

// PDF validation checklist source codes

type

TValidationCheck = record

Name: string;

Passed: Boolean;

Message: string;

end;

function ValidatePDFDocument(PDF: TPDFDocument): TArray<TValidationCheck>;

var

Checks: TArray<TValidationCheck>;

begin

SetLength(Checks, 10);

// Check 1: File header

Checks[0].Name := 'PDF Header';

Checks[0].Passed := ValidatePDFVersion(PDF.Version);

Checks[0].Message := 'PDF version: ' + PDF.Version;

// Check 2: Document catalog

Checks[1].Name := 'Document Catalog';

Checks[1].Passed := PDF.Catalog <> nil;

Checks[1].Message := 'Root catalog ' + IfThen(Checks[1].Passed, 'found', 'missing');

// Check 3: Page tree structure

Checks[2].Name := 'Page Tree';

Checks[2].Passed := ValidatePageTreeStructure(PDF);

Checks[2].Message := Format('Page tree contains %d pages', [PDF.PageCount]);

// Check 4: Cross-reference table

Checks[3].Name := 'Cross-Reference Table';

Checks[3].Passed := ValidateXRefConsistency(PDF);

Checks[3].Message := 'XRef table consistency check';

// Check 5: Object integrity

Checks[4].Name := 'Object Integrity';

Checks[4].Passed := ValidateObjectIntegrity(PDF);

Checks[4].Message := 'All referenced objects exist';

// Check 6: Page content streams

Checks[5].Name := 'Content Streams';

Checks[5].Passed := ValidateContentStreams(PDF);

Checks[5].Message := 'All pages have valid content';

// Check 7: Font resources

Checks[6].Name := 'Font Resources';

Checks[6].Passed := ValidateFontResources(PDF);

Checks[6].Message := 'Font resources are complete';

// Check 8: Image resources

Checks[7].Name := 'Image Resources';

Checks[7].Passed := ValidateImageResources(PDF);

Checks[7].Message := 'Image resources are accessible';

// Check 9: Encryption

Checks[8].Name := 'Encryption';

Checks[8].Passed := ValidateEncryption(PDF);

Checks[8].Message := 'Encryption settings are valid';

// Check 10: Metadata

Checks[9].Name := 'Metadata';

Checks[9].Passed := ValidateMetadata(PDF);

Checks[9].Message := 'Document metadata is well-formed';

Result := Checks;

end;

실용적인 검증: 실제 PDF 분석

이 문서에 제시된 개념을 검증하기 위해, 문제 있는 PDF 파일에 대해 qpdf를 사용하여 실제 분석을 수행했습니다. 결과는 페이지 순서 문제에 대한 완벽한 증거를 보여주었습니다.

실제 qpdf 출력 분석

명령어: qpdf --show-pages input-all.pdf

결과:

page 1: 20 0 R

content: 192 0 R

page 2: 1 0 R

content: 190 0 R

page 3: 4 0 R

content: 188 0 R

분석:

논리 페이지 1 → 객체 20 (가장 높은 번호)
논리 페이지 2 → 객체 1 (가장 낮은 번호)
Logical Page 3 → Object 4 (중간 숫자)

이 실제 예시는 객체 순서 파싱이 실패하는 이유를 증명합니다. 객체를 숫자로 처리하면 (1, 4, 20) 페이지 순서가 (2, 3, 1)이 되지만, 올바른 논리적 순서 (1, 2, 3)이 아닙니다.

검증 명령어

다음 qpdf 명령어는 문서 구조를 성공적으로 검증했습니다.

# Show page structure - WORKS

qpdf --show-pages input-all.pdf

# Show detailed page info in JSON - WORKS

qpdf --json=latest --json-key=pages input-all.pdf

# Validate PDF structure - WORKS

qpdf --check input-all.pdf

# Output: "No syntax or stream encoding errors found"

# Show cross-reference table - WORKS

qpdf --show-xref input-all.pdf

# Show specific object (e.g., pages tree root)

qpdf --json=latest --json-key=qpdf input-all.pdf | findstr "Pages"

# Output: "/Pages": "16 0 R"

실제 영향

이 분석은 저희의 관련 기사에 설명된 디버깅 접근 방식을 검증했습니다. 수정 사항은 페이지를 논리적 순서로 처리하도록 구현하는 것으로, 이는 앞에서 설명한 문제를 직접적으로 해결합니다. ReorderPageArrByPagesTree 논리적 순서로 페이지를 처리하도록 구현합니다.

결론

PDF 페이지 트리를 이해하는 것은 안정적인 PDF 조작에 중요하지만, PDF 문서 구조를 마스터하는 첫걸음일 뿐입니다. 이 포괄적인 분석에서는 다음을 다루었습니다.

기술 숙련도 포인트

문서 아키텍처: PDF는 복잡한 객체 데이터베이스이며, 복잡한 참조 시스템을 가지고 있습니다.
페이지 트리 탐색: 논리적 순서(Kids 배열)와 물리적 순서는 주의해서 처리해야 합니다.
객체 관계: 객체들이 서로 어떻게 참조하는지 이해하면 파싱 오류를 방지할 수 있습니다.
상속 패턴페이지 속성은 트리 계층 구조에서 부모 노드로부터 상속됩니다.
오류 복구강력한 파싱 기능은 잘못된 문서를 우아하게 처리합니다.

다루는 고급 개념

중첩 구조실제 PDF 파일은 종종 다단계 페이지 트리를 가지고 있습니다.
객체 유형페이지 외에도 PDF 파일에는 글꼴, 이미지, 양식 및 메타데이터가 포함됩니다.
성능 최적화.: 큰 문서는 지연 로딩 및 메모리 관리가 필요합니다.
: 검증 전략: 포괄적인 검사는 미묘한 버그를 방지합니다.
: 도구 통합: 전문 도구는 디버깅 및 분석 기능을 향상시킵니다.

: 개발 모범 사례

: 사양을 준수하십시오.: ISO 32000은 PDF의 표준 구조를 정의합니다.
방어적 프로그래밍을 구현합니다.항상 문서 구조에 대한 가정을 검증합니다.
적절한 도구를 사용합니다.디버깅을 위해 기존 PDF 분석 도구를 활용합니다.
종합적인 테스트를 수행합니다.다양한 PDF 생성 도구는 서로 다른 구조를 생성합니다.
지능적으로 캐싱합니다.메모리 사용량과 성능 요구 사항의 균형을 맞춥니다.

실제 적용 사례

이 가이드에 설명된 개념은 다음 사항에 적용됩니다.

PDF 뷰어: 올바른 페이지 순서 및 렌더링
문서 처리기: 페이지 추출, 병합 및 조작
접근성 도구: 스크린 리더를 위한 구조 이해
아카이브 시스템: 장기 문서 보존
보안 분석: 법의학 분석을 위한 구조 이해

주요 내용

PDF 페이지 순서가 사소한 기술적인 세부 사항처럼 보일 수 있지만, 잘못하면 추적하기 어려운 미묘한 버그가 발생할 수 있습니다. 기본적인 원리는 간단합니다. 항상 PDF 사양에 정의된 논리적 구조를 존중해야 하며, 파일 내 객체의 물리적 배열을 따르지 않아야 합니다..

이러한 개념을 이해하고 올바르게 구현하면 실제 문서의 복잡성을 처리할 수 있는 PDF 처리 애플리케이션을 구축할 수 있습니다. 간단한 페이지 추출기이든 정교한 문서 관리 시스템이든, 이 기반 지식은 큰 도움이 될 것입니다.

기억하세요: PDF는 특정 규칙을 가진 구조화된 문서입니다. 코드에서 이러한 규칙을 준수하면 더 나은 호환성, 사용자 불만 감소, 더 강력한 애플리케이션을 만들 수 있습니다. PDF 구조를 이해하는 데 투자하면 디버깅 시간을 줄이고 사용자 만족도를 높일 수 있습니다.