DelphiでPDFium VCLを利用してPDFドキュメントからテキストを抽出する

テキスト抽出は、PDF処理において最も一般的なタスクの一つです。ドキュメント検索エンジン、データマイニングアプリケーション、またはコンテンツ管理システムを構築する場合でも、PDFファイルからテキストを抽出する機能は不可欠です。このチュートリアルでは、 テキスト抽出 デモについて説明します。このデモでは、PDFium VCLを利用して、PDFドキュメントからテキストコンテンツを抽出する方法を示します。

概要

テキスト抽出デモでは、PDFドキュメントからすべてのテキストコンテンツを抽出し、テキストファイルに保存する方法を示します。ページ範囲の選択、段落の保持、および特殊文字の適切な処理をサポートします。

主な機能

ドキュメント全体の抽出 – すべてのページからテキストを一度に抽出
ページ範囲の選択 – 特定のページからのみテキストを抽出
段落の検出 文字位置に基づいて段落構造を保持します。
特殊文字の処理: オプションで出力からNUL文字を削除します。
ページ区切り: オプションでページ間に空行を追加します。
印刷進捗状況 視覚的な進捗バーと詳細なログ記録を備えています。
UTF-8出力: 国際的なドキュメント向けにテキスト出力を正しくエンコードします。
文字単位のアクセス – 高度な処理のために個々の文字へアクセスします

PDFium DLL の要件

PDFium VCL アプリケーションを実行する前に、PDFium DLL ファイルがインストールされていることを確認してください。

pdfium32.dll / pdfium64.dll – 標準バージョン（約 5-6 MB）
pdfium32v8.dll / pdfium64v8.dll – V8 JavaScript エンジン付き（約 23-27 MB）

インストール: 実行 PDFiumVCL\DLLs\CopyDlls.bat 管理者権限で、DLLファイルをWindowsシステムディレクトリに自動的にコピーします。

基本的なテキスト抽出

PDFページからテキストを抽出する最も簡単な方法:

procedure ExtractSimpleText;

var

Pdf: TPdf;

PageText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

// Extract text from page 1

Pdf.PageNumber := 1;

PageText := Pdf.Text;

// Use the extracted text

Memo1.Lines.Text := PageText;

finally

Pdf.Active := False;

Pdf.Free;

end;

すべてのページから抽出

すべてのページを順に処理して、ドキュメント全体のテキストを抽出します:

procedure TFormMain.ButtonExtractClick(Sender: TObject);

var

I, StartPage, EndPage: Integer;

PageText: string;

FileStream: TFileStream;

Text: UTF8String;

begin

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

try

// Determine page range

if RadioButtonAllPages.Checked then

begin

StartPage := 1;

EndPage := Pdf.PageCount;

end

else

begin

StartPage := StrToInt(EditFromPage.Text);

EndPage := StrToInt(EditToPage.Text);

end;

// Create output file

FileStream := TFileStream.Create(EditOutputFile.Text, fmCreate);

try

for I := StartPage to EndPage do

begin

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Convert to UTF-8 and write

Text := UTF8Encode(PageText);

if Length(Text) > 0 then

FileStream.WriteBuffer(Text[1], Length(Text));

// Add page separator if enabled

if CheckBoxPageSeparator.Checked and (I < EndPage) then

begin

Text := UTF8Encode(#13#10#13#10#13#10);

FileStream.WriteBuffer(Text[1], Length(Text));

end;

ProgressBar.Position := I - StartPage + 1;

Application.ProcessMessages;

end;

finally

FileStream.Free;

end;

finally

Pdf.Active := False;

end;

段落を保持するテキスト抽出

段落構造が重要なドキュメントでは、文字位置の分析を利用します:

function ExtractTextWithParagraphs(Pdf: TPdf): string;

var

CharIndex: Integer;

CurrentChar: WideChar;

CurrentY, PrevY: Double;

LineHeight, YGap: Double;

ResultText, LineBuffer: string;

MinLineHeight: Double;

begin

ResultText := '';

LineBuffer := '';

PrevY := -1;

MinLineHeight := 999999;

// First pass: determine typical line height

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if (YGap > 0) and (YGap < MinLineHeight) then

MinLineHeight := YGap;

end;

PrevY := CurrentY;

end;

LineHeight := MinLineHeight;

if LineHeight <= 0 then

LineHeight := 12; // Default fallback

// Second pass: build text with paragraph detection

PrevY := -1;

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentChar := Pdf.Character[CharIndex];

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

// Skip NUL characters

if Ord(CurrentChar) = 0 then

Continue;

// Check for line break based on Y position change

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if YGap > LineHeight * 1.2 then

begin

// Add current line to result

if LineBuffer <> '' then

begin

ResultText := ResultText + LineBuffer + #13#10;

LineBuffer := '';

end;

// Check if this is a paragraph break (larger gap)

if YGap > LineHeight * 2.5 then

ResultText := ResultText + #13#10; // Extra line for paragraph

end;

LineBuffer := LineBuffer + CurrentChar;

PrevY := CurrentY;

end;

// Add final line

if LineBuffer <> '' then

ResultText := ResultText + LineBuffer;

Result := ResultText;

end;

清理抽出的テキスト

空文字を削除し、テキストを正規化します。

function CleanAndFormatText(const RawText: string): UTF8String;

var

I: Integer;

CleanText: string;

begin

CleanText := '';

for I := 1 to Length(RawText) do

begin

// Skip NUL characters but keep all other characters

if Ord(RawText[I]) <> 0 then

CleanText := CleanText + RawText[I];

end;

Result := UTF8Encode(CleanText);

end;

から特定領域抽出テキスト

からページ上的矩形領域抽出テキスト:

procedure ExtractTextFromRegion;

var

Pdf: TPdf;

RegionText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Extract text from specific rectangle

// Parameters: Left, Top, Right, Bottom (in PDF coordinates)

RegionText := Pdf.TextInRectangle(100, 700, 500, 600);

ShowMessage('Text in region: ' + RegionText);

finally

Pdf.Active := False;

Pdf.Free;

end;

文字単位のアクセス

正確なテキスト分析のために、個々の文字へアクセスします。

procedure AnalyzeCharacters;

var

Pdf: TPdf;

I: Integer;

Char: WideChar;

Origin: TPdfPoint;

Rect: TPdfRectangle;

FontSize: Double;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Access each character

for I := 0 to Pdf.CharacterCount - 1 do

begin

Char := Pdf.Character[I];

Origin := Pdf.CharacterOrigin[I];

Rect := Pdf.CharacterRectangle[I];

FontSize := Pdf.FontSize[I];

// Check character properties

if Pdf.CharacterGenerated[I] then

// Character was generated (e.g., hyphenation)

Continue;

if Pdf.CharacterMapError[I] then

// Character couldn't be mapped to Unicode

Continue;

// Process character with position and size info

Memo1.Lines.Add(Format('Char: %s at (%.2f, %.2f) size: %.2f',

[Char, Origin.X, Origin.Y, FontSize]));

end;

finally

Pdf.Active := False;

Pdf.Free;

end;

検索屏幕位置处的文字

についてはテキスト選択和交互很有用:

function GetCharacterAtPosition(Pdf: TPdf; X, Y: Double): Integer;

begin

// Get character index at position with tolerance

Result := Pdf.CharacterIndexAtPos(X, Y, 5.0, 5.0);

end;

エラーと境界ケースの処理

procedure TFormMain.SafeExtractText;

begin

try

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

except

on E: Exception do

begin

LogMessage('Failed to load PDF: ' + E.Message);

Exit;

end;

try

for I := StartPage to EndPage do

begin

try

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Process text...

except

on E: Exception do

begin

// Log error but continue with next page

LogMessage('Error on page ' + IntToStr(I) + ': ' + E.Message);

end;

finally

Pdf.Active := False;

end;

性能考量

すべての内容を一度にメモリへ読み込むのではなく、ページごとにテキストを抽出します。
については大型ドキュメント、してください利用流式ファイル出力。
ループ内で Application.ProcessMessages を呼び出し、UI の応答性を高めます。
複数のドキュメントを対象にしたバッチ処理を検討します。

結論

“Extract Text” デモは、PDFium VCL によってテキスト抽出を簡潔かつ信頼性の高い処理として実装できることを示します。基本的なテキスト抽出から、段落構造を考慮した高度な処理まで、必要なツールを利用できます。

文字単位のアクセスにより複雑なテキスト分析が可能になり、シンプルな用途では Text プロパティだけで多くの一般的なケースに対応できます。

开始構築 テキスト抽出解決策。 PDFium VCLコンポーネント今日から。