Extracting Text from PDF Documents with PDFium VCL in Delphi

Text extraction is one of the most common PDF processing tasks. Whether you’re building a document search engine, data mining application, or content management system, the ability to extract text from PDF files is essential. This tutorial covers the Extract Text demo, which shows how to extract text content from PDF documents using PDFium VCL.

Overview

The Extract Text demo demonstrates how to extract all text content from a PDF document and save it to a text file. It supports page range selection, paragraph preservation, and handles special characters properly.

Key Features

Full Document Extraction – Extract text from all pages at once
Page Range Selection – Extract text from specific pages only
Paragraph Detection – Preserve paragraph structure based on character positions
Special Character Handling – Option to remove NUL characters from output
Page Separators – Optional blank lines between pages
Progress Tracking – Visual progress bar and detailed logging
UTF-8 Output – Properly encoded text output for international documents
Character-Level Access – Access individual characters for advanced processing

PDFium DLL Requirements

Before running any PDFium VCL application, ensure the PDFium DLL files are installed:

pdfium32.dll / pdfium64.dll – Standard versions (~5-6 MB)
pdfium32v8.dll / pdfium64v8.dll – With V8 JavaScript engine (~23-27 MB)

Installation: Run PDFiumVCL\DLLs\CopyDlls.bat as Administrator to automatically copy the DLLs to Windows system directories.

Basic Text Extraction

The simplest way to extract text from a PDF page:

procedure ExtractSimpleText;
var
  Pdf: TPdf;
  PageText: string;
begin
  Pdf := TPdf.Create(nil);
  try
    Pdf.FileName := 'document.pdf';
    Pdf.Active := True;
    
    // Extract text from page 1
    Pdf.PageNumber := 1;
    PageText := Pdf.Text;
    
    // Use the extracted text
    Memo1.Lines.Text := PageText;
    
  finally
    Pdf.Active := False;
    Pdf.Free;
  end;
end;

procedure ExtractSimpleText;

var

Pdf: TPdf;

PageText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

// Extract text from page 1

Pdf.PageNumber := 1;

PageText := Pdf.Text;

// Use the extracted text

Memo1.Lines.Text := PageText;

finally

Pdf.Active := False;

Pdf.Free;

end;

Extracting from All Pages

Loop through all pages to extract the complete document text:

procedure TFormMain.ButtonExtractClick(Sender: TObject);
var
  I, StartPage, EndPage: Integer;
  PageText: string;
  FileStream: TFileStream;
  Text: UTF8String;
begin
  Pdf.FileName := EditPdfFile.Text;
  Pdf.PageNumber := 0;
  Pdf.Active := True;
  
  try
    // Determine page range
    if RadioButtonAllPages.Checked then
    begin
      StartPage := 1;
      EndPage := Pdf.PageCount;
    end
    else
    begin
      StartPage := StrToInt(EditFromPage.Text);
      EndPage := StrToInt(EditToPage.Text);
    end;
    
    // Create output file
    FileStream := TFileStream.Create(EditOutputFile.Text, fmCreate);
    try
      for I := StartPage to EndPage do
      begin
        Pdf.PageNumber := I;
        PageText := Pdf.Text;
        
        // Convert to UTF-8 and write
        Text := UTF8Encode(PageText);
        if Length(Text) &gt; 0 then
          FileStream.WriteBuffer(Text[1], Length(Text));
          
        // Add page separator if enabled
        if CheckBoxPageSeparator.Checked and (I &lt; EndPage) then
        begin
          Text := UTF8Encode(#13#10#13#10#13#10);
          FileStream.WriteBuffer(Text[1], Length(Text));
        end;
        
        ProgressBar.Position := I - StartPage + 1;
        Application.ProcessMessages;
      end;
    finally
      FileStream.Free;
    end;
    
  finally
    Pdf.Active := False;
  end;
end;

procedure TFormMain.ButtonExtractClick(Sender: TObject);

var

I, StartPage, EndPage: Integer;

PageText: string;

FileStream: TFileStream;

Text: UTF8String;

begin

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

try

// Determine page range

if RadioButtonAllPages.Checked then

begin

StartPage := 1;

EndPage := Pdf.PageCount;

end

else

begin

StartPage := StrToInt(EditFromPage.Text);

EndPage := StrToInt(EditToPage.Text);

end;

// Create output file

FileStream := TFileStream.Create(EditOutputFile.Text, fmCreate);

try

for I := StartPage to EndPage do

begin

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Convert to UTF-8 and write

Text := UTF8Encode(PageText);

if Length(Text) > 0 then

FileStream.WriteBuffer(Text[1], Length(Text));

// Add page separator if enabled

if CheckBoxPageSeparator.Checked and (I < EndPage) then

begin

Text := UTF8Encode(#13#10#13#10#13#10);

FileStream.WriteBuffer(Text[1], Length(Text));

end;

ProgressBar.Position := I - StartPage + 1;

Application.ProcessMessages;

end;

finally

FileStream.Free;

end;

finally

Pdf.Active := False;

end;

Extracting Text with Paragraph Preservation

For documents where paragraph structure is important, use character position analysis:

function ExtractTextWithParagraphs(Pdf: TPdf): string;
var
  CharIndex: Integer;
  CurrentChar: WideChar;
  CurrentY, PrevY: Double;
  LineHeight, YGap: Double;
  ResultText, LineBuffer: string;
  MinLineHeight: Double;
begin
  ResultText := '';
  LineBuffer := '';
  PrevY := -1;
  MinLineHeight := 999999;
  
  // First pass: determine typical line height
  for CharIndex := 0 to Pdf.CharacterCount - 1 do
  begin
    CurrentY := Pdf.CharacterOrigin[CharIndex].Y;
    if PrevY &gt;= 0 then
    begin
      YGap := Abs(CurrentY - PrevY);
      if (YGap &gt; 0) and (YGap &lt; MinLineHeight) then
        MinLineHeight := YGap;
    end;
    PrevY := CurrentY;
  end;
  
  LineHeight := MinLineHeight;
  if LineHeight &lt;= 0 then
    LineHeight := 12; // Default fallback
    
  // Second pass: build text with paragraph detection
  PrevY := -1;
  
  for CharIndex := 0 to Pdf.CharacterCount - 1 do
  begin
    CurrentChar := Pdf.Character[CharIndex];
    CurrentY := Pdf.CharacterOrigin[CharIndex].Y;
    
    // Skip NUL characters
    if Ord(CurrentChar) = 0 then
      Continue;
      
    // Check for line break based on Y position change
    if PrevY &gt;= 0 then
    begin
      YGap := Abs(CurrentY - PrevY);
      
      if YGap &gt; LineHeight * 1.2 then
      begin
        // Add current line to result
        if LineBuffer &lt;&gt; '' then
        begin
          ResultText := ResultText + LineBuffer + #13#10;
          LineBuffer := '';
        end;
        
        // Check if this is a paragraph break (larger gap)
        if YGap &gt; LineHeight * 2.5 then
          ResultText := ResultText + #13#10; // Extra line for paragraph
      end;
    end;
    
    LineBuffer := LineBuffer + CurrentChar;
    PrevY := CurrentY;
  end;
  
  // Add final line
  if LineBuffer &lt;&gt; '' then
    ResultText := ResultText + LineBuffer;
    
  Result := ResultText;
end;

function ExtractTextWithParagraphs(Pdf: TPdf): string;

var

CharIndex: Integer;

CurrentChar: WideChar;

CurrentY, PrevY: Double;

LineHeight, YGap: Double;

ResultText, LineBuffer: string;

MinLineHeight: Double;

begin

ResultText := '';

LineBuffer := '';

PrevY := -1;

MinLineHeight := 999999;

// First pass: determine typical line height

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if (YGap > 0) and (YGap < MinLineHeight) then

MinLineHeight := YGap;

end;

PrevY := CurrentY;

end;

LineHeight := MinLineHeight;

if LineHeight <= 0 then

LineHeight := 12; // Default fallback

// Second pass: build text with paragraph detection

PrevY := -1;

for CharIndex := 0 to Pdf.CharacterCount - 1 do

begin

CurrentChar := Pdf.Character[CharIndex];

CurrentY := Pdf.CharacterOrigin[CharIndex].Y;

// Skip NUL characters

if Ord(CurrentChar) = 0 then

Continue;

// Check for line break based on Y position change

if PrevY >= 0 then

begin

YGap := Abs(CurrentY - PrevY);

if YGap > LineHeight * 1.2 then

begin

// Add current line to result

if LineBuffer <> '' then

begin

ResultText := ResultText + LineBuffer + #13#10;

LineBuffer := '';

end;

// Check if this is a paragraph break (larger gap)

if YGap > LineHeight * 2.5 then

ResultText := ResultText + #13#10; // Extra line for paragraph

end;

LineBuffer := LineBuffer + CurrentChar;

PrevY := CurrentY;

end;

// Add final line

if LineBuffer <> '' then

ResultText := ResultText + LineBuffer;

Result := ResultText;

end;

Cleaning Extracted Text

Remove NUL characters and normalize the text:

function CleanAndFormatText(const RawText: string): UTF8String;
var
  I: Integer;
  CleanText: string;
begin
  CleanText := '';
  for I := 1 to Length(RawText) do
  begin
    // Skip NUL characters but keep all other characters
    if Ord(RawText[I]) &lt;&gt; 0 then
      CleanText := CleanText + RawText[I];
  end;
  Result := UTF8Encode(CleanText);
end;

function CleanAndFormatText(const RawText: string): UTF8String;

var

I: Integer;

CleanText: string;

begin

CleanText := '';

for I := 1 to Length(RawText) do

begin

// Skip NUL characters but keep all other characters

if Ord(RawText[I]) <> 0 then

CleanText := CleanText + RawText[I];

end;

Result := UTF8Encode(CleanText);

end;

Extracting Text from a Specific Region

Extract text from a rectangular region of the page:

procedure ExtractTextFromRegion;
var
  Pdf: TPdf;
  RegionText: string;
begin
  Pdf := TPdf.Create(nil);
  try
    Pdf.FileName := 'document.pdf';
    Pdf.Active := True;
    Pdf.PageNumber := 1;
    
    // Extract text from specific rectangle
    // Parameters: Left, Top, Right, Bottom (in PDF coordinates)
    RegionText := Pdf.TextInRectangle(100, 700, 500, 600);
    
    ShowMessage('Text in region: ' + RegionText);
    
  finally
    Pdf.Active := False;
    Pdf.Free;
  end;
end;

procedure ExtractTextFromRegion;

var

Pdf: TPdf;

RegionText: string;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Extract text from specific rectangle

// Parameters: Left, Top, Right, Bottom (in PDF coordinates)

RegionText := Pdf.TextInRectangle(100, 700, 500, 600);

ShowMessage('Text in region: ' + RegionText);

finally

Pdf.Active := False;

Pdf.Free;

end;

Character-Level Access

For precise text analysis, access individual characters:

procedure AnalyzeCharacters;
var
  Pdf: TPdf;
  I: Integer;
  Char: WideChar;
  Origin: TPdfPoint;
  Rect: TPdfRectangle;
  FontSize: Double;
begin
  Pdf := TPdf.Create(nil);
  try
    Pdf.FileName := 'document.pdf';
    Pdf.Active := True;
    Pdf.PageNumber := 1;
    
    // Access each character
    for I := 0 to Pdf.CharacterCount - 1 do
    begin
      Char := Pdf.Character[I];
      Origin := Pdf.CharacterOrigin[I];
      Rect := Pdf.CharacterRectangle[I];
      FontSize := Pdf.FontSize[I];
      
      // Check character properties
      if Pdf.CharacterGenerated[I] then
        // Character was generated (e.g., hyphenation)
        Continue;
        
      if Pdf.CharacterMapError[I] then
        // Character couldn't be mapped to Unicode
        Continue;
        
      // Process character with position and size info
      Memo1.Lines.Add(Format('Char: %s at (%.2f, %.2f) size: %.2f',
        [Char, Origin.X, Origin.Y, FontSize]));
    end;
    
  finally
    Pdf.Active := False;
    Pdf.Free;
  end;
end;

procedure AnalyzeCharacters;

var

Pdf: TPdf;

I: Integer;

Char: WideChar;

Origin: TPdfPoint;

Rect: TPdfRectangle;

FontSize: Double;

begin

Pdf := TPdf.Create(nil);

try

Pdf.FileName := 'document.pdf';

Pdf.Active := True;

Pdf.PageNumber := 1;

// Access each character

for I := 0 to Pdf.CharacterCount - 1 do

begin

Char := Pdf.Character[I];

Origin := Pdf.CharacterOrigin[I];

Rect := Pdf.CharacterRectangle[I];

FontSize := Pdf.FontSize[I];

// Check character properties

if Pdf.CharacterGenerated[I] then

// Character was generated (e.g., hyphenation)

Continue;

if Pdf.CharacterMapError[I] then

// Character couldn't be mapped to Unicode

Continue;

// Process character with position and size info

Memo1.Lines.Add(Format('Char: %s at (%.2f, %.2f) size: %.2f',

[Char, Origin.X, Origin.Y, FontSize]));

end;

finally

Pdf.Active := False;

Pdf.Free;

end;

Finding Character at Screen Position

Useful for text selection and interaction:

function GetCharacterAtPosition(Pdf: TPdf; X, Y: Double): Integer;
begin
  // Get character index at position with tolerance
  Result := Pdf.CharacterIndexAtPos(X, Y, 5.0, 5.0);
end;

function GetCharacterAtPosition(Pdf: TPdf; X, Y: Double): Integer;

begin

// Get character index at position with tolerance

Result := Pdf.CharacterIndexAtPos(X, Y, 5.0, 5.0);

end;

Handling Errors and Edge Cases

procedure TFormMain.SafeExtractText;
begin
  try
    Pdf.FileName := EditPdfFile.Text;
    Pdf.PageNumber := 0;
    Pdf.Active := True;
  except
    on E: Exception do
    begin
      LogMessage('Failed to load PDF: ' + E.Message);
      Exit;
    end;
  end;
  
  try
    for I := StartPage to EndPage do
    begin
      try
        Pdf.PageNumber := I;
        PageText := Pdf.Text;
        // Process text...
      except
        on E: Exception do
        begin
          // Log error but continue with next page
          LogMessage('Error on page ' + IntToStr(I) + ': ' + E.Message);
        end;
      end;
    end;
  finally
    Pdf.Active := False;
  end;
end;

procedure TFormMain.SafeExtractText;

begin

try

Pdf.FileName := EditPdfFile.Text;

Pdf.PageNumber := 0;

Pdf.Active := True;

except

on E: Exception do

begin

LogMessage('Failed to load PDF: ' + E.Message);

Exit;

end;

try

for I := StartPage to EndPage do

begin

try

Pdf.PageNumber := I;

PageText := Pdf.Text;

// Process text...

except

on E: Exception do

begin

// Log error but continue with next page

LogMessage('Error on page ' + IntToStr(I) + ': ' + E.Message);

end;

finally

Pdf.Active := False;

end;

Performance Considerations

Extract text page by page rather than loading everything into memory
Use streaming file output for large documents
Call Application.ProcessMessages in loops for UI responsiveness
Consider batch processing for multiple documents

Conclusion

The Extract Text demo shows how PDFium VCL makes text extraction simple and reliable. Whether you need basic text extraction or advanced paragraph-aware processing, the component provides all the tools you need.

The character-level access allows for sophisticated text analysis, while the simple Text property handles most common use cases with a single line of code.

Start building your text extraction solution with PDFium VCL Component today.