Estrazione di testo da documenti PDF con PDFium VCL in Delphi.

L'estrazione del testo è una delle attività più comuni nell'elaborazione di file PDF. Che tu stia sviluppando un motore di ricerca di documenti, un'applicazione di data mining o un sistema di gestione dei contenuti, la capacità di estrarre testo dai file PDF è essenziale. Questo tutorial copre la Estrazione del testo demo, che mostra come estrarre il contenuto testuale dai documenti PDF utilizzando PDFium VCL.

Panoramica

La demo di estrazione del testo dimostra come estrarre tutto il contenuto testuale da un documento PDF e salvarlo in un file di testo. Supporta la selezione dell'intervallo di pagine, la conservazione dei paragrafi e gestisce correttamente i caratteri speciali.

Caratteristiche Principali

Estrazione completa del documento – Estrai il testo da tutte le pagine contemporaneamente
Selezione dell'intervallo di pagine – Estrai il testo solo da pagine specifiche
Rilevamento dei paragrafi – Mantenere la struttura dei paragrafi in base alla posizione dei caratteri.
Gestione dei caratteri speciali. – Opzione per rimuovere i caratteri NUL dall'output.
Separatori di pagina. – Righe vuote opzionali tra le pagine.
Monitoraggio dei progressi. – Barra di avanzamento visiva e logging dettagliato.
Output UTF-8. – Testo di output codificato correttamente per documenti internazionali.
Accesso a livello di carattere – Accedi ai singoli caratteri per l'elaborazione avanzata.

Requisiti delle librerie DLL di PDFium

Prima di eseguire qualsiasi applicazione VCL di PDFium, assicurarsi che i file DLL di PDFium siano installati:

pdfium32.dll / pdfium64.dll – Versioni standard (~5-6 MB)
pdfium32v8.dll / pdfium64v8.dll – Con il motore JavaScript V8 (~23-27 MB)

Installazione: Esegui PDFiumVCL\DLLs\CopyDlls.bat come amministratore per copiare automaticamente i file DLL nelle directory di sistema di Windows.

Estrazione di testo di base.

Il modo più semplice per estrarre il testo da una pagina PDF:

procedure ExtractSimpleText;

var

Pdf: TPdf;

PageText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

// Extract text from page 1

Pdf.PageNumber := 1;

PageText := Pdf.Text;

// Use the extracted text

Memo1.Lines.Text := PageText;

finally

Pdf.Active := False;

Pdf.Free;

end;

Estrazione da tutte le pagine.

Scorre tutte le pagine per estrarre l'intero testo del documento:

procedure TFormMain.ButtonExtractClick(Sender: TObject);

var

I, StartPage, EndPage: Integer;

PageText: string;

FileStream: TFileStream;

Text: UTF8String;

begin

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

try

// Determine page range

if RadioButtonAllPages.Checked then

begin

StartPage := 1;

EndPage := Pdf.PageCount;

end

else

begin

StartPage := StrToInt(EditFromPage.Text);

EndPage := StrToInt(EditToPage.Text);

end;

// Create output file

FileStream := TFileStream.Create(EditOutputFile.Text, fmCreate);

try

for I := StartPage to EndPage do

begin

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Convert to UTF-8 and write

Text := UTF8Encode(PageText);

if Length(Text) > 0 then

FileStream.WriteBuffer(Text[1], Length(Text));

// Add page separator if enabled

if CheckBoxPageSeparator.Checked and (I < EndPage) then

begin

Text := UTF8Encode(#13#10#13#10#13#10);

FileStream.WriteBuffer(Text[1], Length(Text));

end;

ProgressBar.Position := I - StartPage + 1;

Application.ProcessMessages;

end;

finally

FileStream.Free;

end;

finally

Pdf.Active := False;

end;

Estrazione del testo preservando la struttura dei paragrafi.

Per documenti in cui la struttura dei paragrafi è importante, utilizzare l'analisi della posizione dei caratteri:

function ExtractTextWithParagraphs(Pdf: TPdf): string;

var

CharIndex: Integer;

CurrentChar: WideChar;

CurrentY, PrevY: Double;

LineHeight, YGap: Double;

ResultText, LineBuffer: string;

MinLineHeight: Double;

begin

ResultText := '';

LineBuffer := '';

PrevY := -1;

MinLineHeight := 999999;

// First pass: determine typical line height

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if (YGap > 0) and (YGap < MinLineHeight) then

MinLineHeight := YGap;

end;

PrevY := CurrentY;

end;

LineHeight := MinLineHeight;

if LineHeight <= 0 then

LineHeight := 12; // Default fallback

// Second pass: build text with paragraph detection

PrevY := -1;

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentChar := Pdf.Character[CharIndex];

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

// Skip NUL characters

if Ord(CurrentChar) = 0 then

Continue;

// Check for line break based on Y position change

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if YGap > LineHeight * 1.2 then

begin

// Add current line to result

if LineBuffer <> '' then

begin

ResultText := ResultText + LineBuffer + #13#10;

LineBuffer := '';

end;

// Check if this is a paragraph break (larger gap)

if YGap > LineHeight * 2.5 then

ResultText := ResultText + #13#10; // Extra line for paragraph

end;

LineBuffer := LineBuffer + CurrentChar;

PrevY := CurrentY;

end;

// Add final line

if LineBuffer <> '' then

ResultText := ResultText + LineBuffer;

Result := ResultText;

end;

Pulizia del testo estratto.

Rimuovere i caratteri NUL e normalizzare il testo:

function CleanAndFormatText(const RawText: string): UTF8String;

var

I: Integer;

CleanText: string;

begin

CleanText := '';

for I := 1 to Length(RawText) do

begin

// Skip NUL characters but keep all other characters

if Ord(RawText[I]) <> 0 then

CleanText := CleanText + RawText[I];

end;

Result := UTF8Encode(CleanText);

end;

Estrazione del testo da una regione specifica.

Estrarre il testo da una regione rettangolare della pagina:

procedure ExtractTextFromRegion;

var

Pdf: TPdf;

RegionText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Extract text from specific rectangle

// Parameters: Left, Top, Right, Bottom (in PDF coordinates)

RegionText := Pdf.TextInRectangle(100, 700, 500, 600);

ShowMessage('Text in region: ' + RegionText);

finally

Pdf.Active := False;

Pdf.Free;

end;

Accesso a livello di carattere

Per un'analisi precisa del testo, accedere ai singoli caratteri:

procedure AnalyzeCharacters;

var

Pdf: TPdf;

I: Integer;

Char: WideChar;

Origin: TPdfPoint;

Rect: TPdfRectangle;

FontSize: Double;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Access each character

for I := 0 to Pdf.CharacterCount - 1 do

begin

Char := Pdf.Character[I];

Origin := Pdf.CharacterOrigin[I];

Rect := Pdf.CharacterRectangle[I];

FontSize := Pdf.FontSize[I];

// Check character properties

if Pdf.CharacterGenerated[I] then

// Character was generated (e.g., hyphenation)

Continue;

if Pdf.CharacterMapError[I] then

// Character couldn't be mapped to Unicode

Continue;

// Process character with position and size info

Memo1.Lines.Add(Format('Char: %s at (%.2f, %.2f) size: %.2f',

[Char, Origin.X, Origin.Y, FontSize]));

end;

finally

Pdf.Active := False;

Pdf.Free;

end;

Trovare il carattere in una posizione sullo schermo

Utile per la selezione e l'interazione con il testo:

function GetCharacterAtPosition(Pdf: TPdf; X, Y: Double): Integer;

begin

// Get character index at position with tolerance

Result := Pdf.CharacterIndexAtPos(X, Y, 5.0, 5.0);

end;

Gestione degli errori e dei casi limite

procedure TFormMain.SafeExtractText;

begin

try

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

except

on E: Exception do

begin

LogMessage('Failed to load PDF: ' + E.Message);

Exit;

end;

try

for I := StartPage to EndPage do

begin

try

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Process text...

except

on E: Exception do

begin

// Log error but continue with next page

LogMessage('Error on page ' + IntToStr(I) + ': ' + E.Message);

end;

finally

Pdf.Active := False;

end;

Considerazioni sulle prestazioni

Estrai il testo pagina per pagina invece di caricare tutto in memoria.
Utilizza l'output di file in streaming per documenti di grandi dimensioni.
Chiama Application.ProcessMessages all'interno dei loop per garantire la reattività dell'interfaccia utente.
Considera l'elaborazione batch per più documenti.

Conclusione.

La demo di estrazione del testo mostra come PDFium VCL rende l'estrazione del testo semplice e affidabile. Che tu abbia bisogno di una semplice estrazione di testo o di un'elaborazione avanzata consapevole dei paragrafi, il componente fornisce tutti gli strumenti necessari.

L'accesso a livello di carattere consente un'analisi sofisticata del testo, mentre la semplicità Text property 解决了大多数常见用例，只需一行代码。

开始构建 您的文本提取解决方案。 Componente PDFium VCL today.