PDF ファイル構造の理解: 技術概要

PDF ファイル構造の技術概要

PDF は、ページに見える内容だけでなく、オブジェクト、相互参照、トレーラーを持つ構造化ファイルです。内部構造を理解すると、生成、解析、修復、最適化の実装で起こりやすい問題を切り分けやすくなります。

四つの主要セクション

ヘッダー: PDF バージョンとバイナリファイルであることを示します。
本文: カタログ、ページ、フォント、画像、コンテンツストリームなどのオブジェクトを格納します。
相互参照情報: 各オブジェクトの位置を示し、ランダムアクセスを可能にします。
トレーラー: ルートオブジェクト、情報辞書、xref の開始位置などを示します。

基本構造例 1

PDF ファイルの主要セクションと最小オブジェクトの例です。

%PDF-1.0 % Header starts here

%âãÏÓ

1 0 obj % Body starts here

/Kids [2 0 R]

/Count 1

/Type /Pages

endobj

2 0 obj

/Rotate 0

/Parent 1 0 R

/Resources 3 0 R

/MediaBox [0 0 612 792]

/Contents [4 0 R]

/Type /Page

endobj

3 0 obj

/Font

/F0

/BaseFont /Times-Italic

/Subtype /Type1

/Type /Font

endobj

4 0 obj

/Length 65

stream

1. 0. 0. 1. 50. 700. cm BT

/F0 36. Tf

(Hello, World!) Tj

endstream

endobj

5 0 obj

/Pages 1 0 R

/Type /Catalog

endobj

xref % Cross-reference table starts here

0 6

0000000000 65535 f

0000000015 00000 n

0000000074 00000 n

0000000192 00000 n

0000000291 00000 n

0000000409 00000 n

trailer % Trailer starts here

/Root 5 0 R

/Size 6

startxref

459

%%EOF

基本構造例 2

PDF ファイルの主要セクションと最小オブジェクトの例です。

1 2	%PDF-1.0 %âãÏÓ

基本構造例 3

PDF ファイルの主要セクションと最小オブジェクトの例です。

[object_number] [generation_number] obj

[object_content]

endobj

基本構造例 4

PDF ファイルの主要セクションと最小オブジェクトの例です。

1 0 obj

/Kids [2 0 R]

/Count 1

/Type /Pages

endobj

基本構造例 5

PDF ファイルの主要セクションと最小オブジェクトの例です。

xref

0 6 % Six entries starting at object 0

0000000000 65535 f % Special entry for free objects

0000000015 00000 n % Object 1 at byte offset 15

0000000074 00000 n % Object 2 at byte offset 74

0000000192 00000 n % Object 3 at byte offset 192

0000000291 00000 n % Object 4 at byte offset 291

0000000409 00000 n % Object 5 at byte offset 409

オブジェクトと参照

PDF の基本要素には、数値、名前、文字列、配列、辞書、ストリームがあります。複雑な文書では、これらが間接参照で結ばれ、ページツリーやリソース構造を形成します。

オブジェクト例 1

PDF オブジェクトと参照構造を確認するコード例です。

trailer

/Root 5 0 R % Document catalog reference

/Size 6 % Number of xref entries

startxref

459 % Byte offset of xref table

%%EOF % End-of-file marker

オブジェクト例 2

PDF オブジェクトと参照構造を確認するコード例です。

% Integer examples

0 +1 -1 63

% Real number examples

0.0 0. .0 -0.004 65.4

オブジェクト例 3

PDF オブジェクトと参照構造を確認するコード例です。

% Simple string

(Hello, World!)

% String with escaped characters

(Some \\ escaped \(characters\))

% String with balanced parentheses (no escaping needed)

(Red (Rouge))

オブジェクト例 4

PDF オブジェクトと参照構造を確認するコード例です。

1 2	<4F6Eff00> % Bytes 0x4F, 0x6E, 0xFF, 0x00 <48656C6C6F> % "Hello" in ASCII hex

オブジェクト例 5

PDF オブジェクトと参照構造を確認するコード例です。

/French % Simple name

/ % Valid name (just the slash)

/Websafe#20Dark#20Green % Name with encoded spaces (#20 = space)

/A#42 % Name with encoded character (#42 = 'B')

オブジェクト例 6

PDF オブジェクトと参照構造を確認するコード例です。

true % Boolean true

false % Boolean false

null % Null object

オブジェクト例 7

PDF オブジェクトと参照構造を確認するコード例です。

[0 0 400 500] % Four integers (typical rectangle)

[/Green /Blue [/Red /Yellow]] % Mixed types with nested array

[1 0 R 2 0 R 3 0 R] % Array of indirect references

ストリーム、フィルター、更新

ページ内容や画像はストリームとして保存され、FlateDecode などのフィルターで圧縮されることがあります。また、PDF は増分更新を使って既存内容を残したまま変更を追記できます。

ストリームと更新 1

圧縮、フィルター、増分更新に関する例です。

<</One 1 /Two 2 /Three 3>> % Simple mappings

<< % Multi-line dictionary

/Type /Page

/Parent 1 0 R

/Resources 3 0 R

/MediaBox [0 0 612 792]

/Contents [4 0 R]

ストリームと更新 2

圧縮、フィルター、増分更新に関する例です。

4 0 obj

/Length 65 % Stream length in bytes

/Filter /FlateDecode % Optional compression filter

stream

1. 0. 0. 1. 50. 700. cm BT % Binary or text data

/F0 36. Tf

(Hello, World!) Tj

endstream

endobj

ストリームと更新 3

圧縮、フィルター、増分更新に関する例です。

1 2	6 0 R % Reference to object 6, generation 0 <</Resources 10 0 R /Contents [4 0 R]>> % Dictionary using references

ストリームと更新 4

圧縮、フィルター、増分更新に関する例です。

1 2	/Filter [/ASCII85Decode /DCTDecode] % JPEG data then ASCII85 encoded /Filter [/ASCIIHexDecode /FlateDecode] % Deflate compression then hex encoding

ストリームと更新 5

圧縮、フィルター、増分更新に関する例です。

%PDF-1.4

%âãÏÓ

4 0 obj % Linearization dictionary

<< /E 200967 % End of first page

/H [ 667 140 ] % Hint stream location and length

/L 201431 % File length

/Linearized 1 % Linearization flag

/N 1 % Number of pages

/O 7 % First page object number

/T 201230 % Traditional xref table offset

endobj

実装時の確認ポイント

startxref と xref の整合性を検証する。
参照先が存在するか、循環や欠落がないかを確認する。
ストリーム長と実際のデータ長を照合する。
増分更新を扱うときは最新世代のオブジェクトを選択する。
線形化 PDF では Web 表示向けの配置規則を壊さない。

実装補足 1

PDF の読み書きや検証処理に関連する補足コードです。

pdfobject ::= Null

| Boolean of bool

| Integer of int

| Real of real

| String of string

| Name of string

| Array of pdfobject array

| Dictionary of (string, pdfobject) array

| Stream of (pdfobject, bytes)

| Indirect of int

実装補足 2

PDF の読み書きや検証処理に関連する補足コードです。

Dictionary [

("Kids", Array [Indirect 2]);

("Count", Integer 1);

("Type", Name "Pages")

]

実装補足 3

PDF の読み書きや検証処理に関連する補足コードです。

% Linearize PDF for web optimization

pdfopt input.pdf output.pdf

% Decompress streams for manual inspection

pdftk input.pdf output decompressed.pdf uncompress

% Extract and analyze PDF structure

pdf-parser --stats document.pdf

% Repair corrupted PDF files

pdftk broken.pdf output repaired.pdf

% Extract specific pages

pdftk document.pdf cat 1-3 output pages1-3.pdf

% Get comprehensive PDF information

pdfinfo -meta -struct document.pdf

% Convert PDF to PostScript for analysis

pdftops document.pdf document.ps

まとめ

PDF のファイル構造は厳密ですが、基本はヘッダー、本文、相互参照、トレーラーの関係です。この関係を押さえると、ビューアで開けるだけでなく、修復可能で長期運用に耐える PDF を扱えます。