Delphi에서 PDFium VCL을 사용하여 PDF 문서에서 텍스트 추출하기.

텍스트 추출은 가장 일반적인 PDF 처리 작업 중 하나입니다. 문서 검색 엔진, 데이터 마이닝 애플리케이션 또는 콘텐츠 관리 시스템을 구축하든, PDF 파일에서 텍스트를 추출하는 기능은 필수적입니다. 이 튜토리얼에서는 다음 내용을 다룹니다. 텍스트 추출 데모, PDFium VCL을 사용하여 PDF 문서에서 텍스트 콘텐츠를 추출하는 방법을 보여줍니다.

개요

텍스트 추출 데모는 PDF 문서의 모든 텍스트 콘텐츠를 추출하여 텍스트 파일에 저장하는 방법을 보여줍니다. 페이지 범위 선택, 단락 보존을 지원하며 특수 문자를 올바르게 처리합니다.

주요 기능

전체 문서 추출 – 모든 페이지에서 텍스트를 한 번에 추출
페이지 범위 선택 – 특정 페이지에서만 텍스트를 추출
단락 감지 – 문단 구조를 문자 위치를 기준으로 유지합니다.
특수 문자 처리. – 출력에서 NUL 문자를 제거하는 옵션.
페이지 구분 기호. – 페이지 사이에 선택적으로 빈 줄을 추가합니다.
진행 상황 추적 – 시각적 진행률 표시줄 및 상세 로깅.
UTF-8 출력. – 국제 문서에 적합한 올바르게 인코딩된 텍스트 출력.
문자 단위 접근 – 개별 문자에 접근하여 고급 처리를 수행합니다.

PDFium DLL 요구 사항

PDFium VCL 애플리케이션을 실행하기 전에, PDFium DLL 파일이 설치되어 있는지 확인하십시오.

pdfium32.dll / pdfium64.dll – 표준 버전 (약 5-6 MB)
pdfium32v8.dll / pdfium64v8.dll – V8 JavaScript 엔진 포함 (약 23-27 MB)

설치: 실행 PDFiumVCL\DLLs\CopyDlls.bat 관리자 권한으로 실행하여 DLL 파일을 Windows 시스템 디렉터리에 자동으로 복사합니다.

기본 텍스트 추출

PDF 페이지에서 텍스트를 추출하는 가장 간단한 방법:

procedure ExtractSimpleText;

var

Pdf: TPdf;

PageText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

// Extract text from page 1

Pdf.PageNumber := 1;

PageText := Pdf.Text;

// Use the extracted text

Memo1.Lines.Text := PageText;

finally

Pdf.Active := False;

Pdf.Free;

end;

모든 페이지에서 추출

모든 페이지를 순회하여 전체 문서 텍스트를 추출합니다:

procedure TFormMain.ButtonExtractClick(Sender: TObject);

var

I, StartPage, EndPage: Integer;

PageText: string;

FileStream: TFileStream;

Text: UTF8String;

begin

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

try

// Determine page range

if RadioButtonAllPages.Checked then

begin

StartPage := 1;

EndPage := Pdf.PageCount;

end

else

begin

StartPage := StrToInt(EditFromPage.Text);

EndPage := StrToInt(EditToPage.Text);

end;

// Create output file

FileStream := TFileStream.Create(EditOutputFile.Text, fmCreate);

try

for I := StartPage to EndPage do

begin

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Convert to UTF-8 and write

Text := UTF8Encode(PageText);

if Length(Text) > 0 then

FileStream.WriteBuffer(Text[1], Length(Text));

// Add page separator if enabled

if CheckBoxPageSeparator.Checked and (I < EndPage) then

begin

Text := UTF8Encode(#13#10#13#10#13#10);

FileStream.WriteBuffer(Text[1], Length(Text));

end;

ProgressBar.Position := I - StartPage + 1;

Application.ProcessMessages;

end;

finally

FileStream.Free;

end;

finally

Pdf.Active := False;

end;

단락 구조를 유지하면서 텍스트 추출

단락 구조가 중요한 문서의 경우, 문자 위치 분석을 사용합니다.

function ExtractTextWithParagraphs(Pdf: TPdf): string;

var

CharIndex: Integer;

CurrentChar: WideChar;

CurrentY, PrevY: Double;

LineHeight, YGap: Double;

ResultText, LineBuffer: string;

MinLineHeight: Double;

begin

ResultText := '';

LineBuffer := '';

PrevY := -1;

MinLineHeight := 999999;

// First pass: determine typical line height

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if (YGap > 0) and (YGap < MinLineHeight) then

MinLineHeight := YGap;

end;

PrevY := CurrentY;

end;

LineHeight := MinLineHeight;

if LineHeight <= 0 then

LineHeight := 12; // Default fallback

// Second pass: build text with paragraph detection

PrevY := -1;

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentChar := Pdf.Character[CharIndex];

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

// Skip NUL characters

if Ord(CurrentChar) = 0 then

Continue;

// Check for line break based on Y position change

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if YGap > LineHeight * 1.2 then

begin

// Add current line to result

if LineBuffer <> '' then

begin

ResultText := ResultText + LineBuffer + #13#10;

LineBuffer := '';

end;

// Check if this is a paragraph break (larger gap)

if YGap > LineHeight * 2.5 then

ResultText := ResultText + #13#10; // Extra line for paragraph

end;

LineBuffer := LineBuffer + CurrentChar;

PrevY := CurrentY;

end;

// Add final line

if LineBuffer <> '' then

ResultText := ResultText + LineBuffer;

Result := ResultText;

end;

추출된 텍스트 정리

NUL 문자를 제거하고 텍스트를 정규화합니다.

function CleanAndFormatText(const RawText: string): UTF8String;

var

I: Integer;

CleanText: string;

begin

CleanText := '';

for I := 1 to Length(RawText) do

begin

// Skip NUL characters but keep all other characters

if Ord(RawText[I]) <> 0 then

CleanText := CleanText + RawText[I];

end;

Result := UTF8Encode(CleanText);

end;

특정 영역에서 텍스트 추출

페이지의 사각형 영역에서 텍스트를 추출합니다.

procedure ExtractTextFromRegion;

var

Pdf: TPdf;

RegionText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Extract text from specific rectangle

// Parameters: Left, Top, Right, Bottom (in PDF coordinates)

RegionText := Pdf.TextInRectangle(100, 700, 500, 600);

ShowMessage('Text in region: ' + RegionText);

finally

Pdf.Active := False;

Pdf.Free;

end;

문자 단위 접근

정확한 텍스트 분석을 위해 개별 문자에 접근합니다.

procedure AnalyzeCharacters;

var

Pdf: TPdf;

I: Integer;

Char: WideChar;

Origin: TPdfPoint;

Rect: TPdfRectangle;

FontSize: Double;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Access each character

for I := 0 to Pdf.CharacterCount - 1 do

begin

Char := Pdf.Character[I];

Origin := Pdf.CharacterOrigin[I];

Rect := Pdf.CharacterRectangle[I];

FontSize := Pdf.FontSize[I];

// Check character properties

if Pdf.CharacterGenerated[I] then

// Character was generated (e.g., hyphenation)

Continue;

if Pdf.CharacterMapError[I] then

// Character couldn't be mapped to Unicode

Continue;

// Process character with position and size info

Memo1.Lines.Add(Format('Char: %s at (%.2f, %.2f) size: %.2f',

[Char, Origin.X, Origin.Y, FontSize]));

end;

finally

Pdf.Active := False;

Pdf.Free;

end;

화면 위치에서의 문자 찾기

텍스트 선택 및 상호 작용에 유용합니다.

function GetCharacterAtPosition(Pdf: TPdf; X, Y: Double): Integer;

begin

// Get character index at position with tolerance

Result := Pdf.CharacterIndexAtPos(X, Y, 5.0, 5.0);

end;

오류 처리 및 예외 상황 처리

procedure TFormMain.SafeExtractText;

begin

try

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

except

on E: Exception do

begin

LogMessage('Failed to load PDF: ' + E.Message);

Exit;

end;

try

for I := StartPage to EndPage do

begin

try

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Process text...

except

on E: Exception do

begin

// Log error but continue with next page

LogMessage('Error on page ' + IntToStr(I) + ': ' + E.Message);

end;

finally

Pdf.Active := False;

end;

성능 고려 사항

페이지별로 텍스트를 추출하여 모든 내용을 메모리에 로드하지 않도록 합니다.
대용량 문서의 경우 스트리밍 파일 출력을 사용합니다.
호출 Application.ProcessMessages 반복문 내에서 UI 응답성을 유지합니다.
여러 문서의 경우 일괄 처리 방식을 고려합니다.

결론

Extract Text 데모는 PDFium VCL을 사용하여 텍스트 추출을 간단하고 안정적으로 수행하는 방법을 보여줍니다. 기본 텍스트 추출 또는 고급 문단 인식 처리가 필요한 경우, 이 구성 요소는 필요한 모든 도구를 제공합니다.

문자 단위 접근 방식을 통해 정교한 텍스트 분석이 가능하며, 간단한 Text property 可以使用一行代码处理大多数常见用例。

开始构建 您的文本提取解决方案。 PDFium VCL 컴포넌트 오늘.