Technical Article

Search and Replace Text in an Existing PDF with Delphi

HotPDF Component can search and replace text inside an existing PDF from Delphi and C++Builder. SearchLoadedPageText and SearchLoadedDocumentText locate every occurrence of a string with glyph-level precision, and ReplaceLoadedPageText and ReplaceLoadedDocumentText rewrite the matched bytes in place — provided every replacement character can be re-encoded through the original font, a physical constraint this article treats honestly rather than hiding in a footnote

The request behind this feature is always mundane. A company renames itself and three thousand archived invoices still carry the old name. A contract template shipped with last year's expiry date. A product code was retired and every datasheet that mentions it needs the successor code instead. In a word processor each of these is a thirty-second job. In a PDF it is a genuinely hard problem, and understanding why makes the difference between using the API well and filing a bug report that is really a spec citation

Why is replacing text in a PDF so hard?

Replacing text in a PDF is hard because a PDF page does not contain editable text — it contains positioned glyphs. Under the text-showing model of ISO 32000-1 §9.4, a content stream drives operators like Tj and TJ that paint sequences of character codes at coordinates established by the text matrix. Those codes are not Unicode; they are indices into whatever encoding the page's font declares, and the mapping back to readable characters may live in a /ToUnicode CMap, an encoding difference array, or a CID mapping chain. There is no paragraph object, no text flow, and no guarantee that one visual word is even stored as one string

Replacement adds a second layer of difficulty on top of decoding: you must know exactly which bytes of the original stream produced each glyph, so you can splice new bytes into precisely that span and nothing else. A text extractor can afford to throw the byte positions away once it has the Unicode out. A replacer cannot. That is why HotPDF split the work across two releases — v2.251.0 built the offset-tracking and search layer, and v2.252.0 built the rewriting layer on top of it

Finding text: glyph-level search with byte-offset tracking

HotPDF's SearchLoadedDocumentText finds every occurrence of a needle by matching against the decoded Unicode glyph sequence of each page, not against raw stream bytes, so a hit is a hit regardless of how the font encoded it. The infrastructure underneath was introduced in v2.251.0: the content-stream tokenizer records a StartOfs/EndOfs byte span for every string operand — including its ( ) or < > delimiters — and every decoded glyph carries a TokenIndex/ItemIndex/ByteOffset triple pointing back to the exact operand, TJ-array item, and code unit that produced it. The same glyph interpreter powers the extraction API described in extracting text from a loaded PDF in Delphi; search simply keeps the provenance that extraction discards

Each match comes back as a THPDFTextMatch record carrying the page index, the inclusive glyph range, the user-space X/Y origin and width of the hit, the source token and item index, and the matched text itself. That is enough to drive a highlight overlay, a review UI, or the replace step. A search that finds nothing returns an empty array rather than failing, so the calling pattern stays simple

var
  Pdf: THotPDF;
  Matches: THPDFTextMatchArray;
  I: Integer;
begin
  Pdf := THotPDF.Create(nil);
  try
    if Pdf.LoadFromFile('invoices-2025.pdf') > 0 then
    begin
      if Pdf.SearchLoadedDocumentText('Acme Corp', False, Matches) then
        for I := 0 to Length(Matches) - 1 do
          WriteLn(Format('page %d at (%.1f, %.1f): "%s"',
            [Matches[I].PageIndex, Matches[I].X, Matches[I].Y,
             Matches[I].Text]));
    end;
  finally
    Pdf.Free;
  end;
end;

One deliberate design choice deserves a note. When CaseSensitive is False, the comparison folds case for ASCII characters only, by design: full Unicode case folding behaves differently across the Delphi 5 through XE toolchains HotPDF supports, and a search API that finds different matches depending on which compiler built your application is worse than one with a documented, predictable limit. For Latin business text — names, codes, dates — ASCII folding covers the practical cases

Replacing text: reverse encoding and surgical splicing

ReplaceLoadedDocumentText, added in HotPDF v2.252.0, rewrites every occurrence of a needle by running the decoding machinery backwards. The HPDFEncodeUnicode function is the inverse of the char-code decoder: it walks the same strategy chain in reverse — /ToUnicode bfchar and bfrange lookup, encoding-stream CID mapping, Type0 identity mappings, and the predefined WinAnsi and MacRoman tables — to turn each replacement character back into the char-code bytes the original font expects. The re-encoded bytes are then serialized into a well-formed string literal or hex string, mirroring the tokenizer's own escaping rules so that a parse → re-serialize round trip is stable

The splice itself is surgical rather than wholesale. Only the code-byte range covered by the match is replaced inside the string operand; non-matched bytes in the same operand, the whitespace between tokens, and every surrounding operator are preserved verbatim, byte for byte. Replacing bca inside abcabc yields a + replacement + bc, not a clobbered operand. Replacements may be shorter or longer than the needle — the literal is reserialized and the stream's /Length refreshed — and each /Contents stream of a multi-stream page is processed in isolation so the page stays well-formed

var
  Pdf: THotPDF;
  ReplaceCount: Integer;
begin
  Pdf := THotPDF.Create(nil);
  try
    if Pdf.LoadFromFile('contract-draft.pdf') > 0 then
    begin
      if Pdf.ReplaceLoadedDocumentText('2025-12-31', '2026-12-31',
        True, ReplaceCount) then
        WriteLn(Format('%d operand rewrites performed', [ReplaceCount]));
      Pdf.SaveLoadedDocument('contract-final.pdf');
    end;
  finally
    Pdf.Free;
  end;
end;

Note what the API does not do: it does not re-typeset the page. PDF has no reflow, so a replacement that is visually wider than the original will simply occupy more horizontal space and may crowd whatever was painted to its right. Same-length or near-length substitutions — dates, version strings, part numbers, name corrections — are the sweet spot. Wholesale rewording belongs in the source document, not the PDF

Why can't you replace text with characters the font subset never included?

You cannot replace text with a character the embedded font subset never included, because the byte sequence that would select that character simply does not exist in the font's mapping tables. When a PDF producer embeds a subset font, its /ToUnicode CMap and encoding structures cover only the glyphs the original document actually used. HPDFEncodeUnicode can only reverse a mapping that is present: if the document never contained the letter E in that font, there is no char code for E to reverse to. This is a physical property of the file, not a limitation of any particular library — no tool can conjure a glyph mapping that was never embedded

HotPDF handles the failure conservatively. If any single character of the replacement cannot be re-encoded, that entire needle occurrence is skipped — no exception, no partial garbage text, and the occurrence is simply not counted in ReplaceCount. The practical consequence: check ReplaceCount against the match count from a prior search, and treat a shortfall as a signal. In the date example above, the digit 6 must appear somewhere in the document's text in that same font for the rewrite to succeed — likely in an invoice, never guaranteed in general. When the characters you need are simply not available and the goal is to remove sensitive text rather than reword it, true content removal is the better tool anyway; see redacting and restructuring loaded PDFs in Delphi for that path

var
  Matches: THPDFTextMatchArray;
  Expected, Replaced: Integer;
begin
  Pdf.SearchLoadedDocumentText('Acme Corp', True, Matches);
  Expected := Length(Matches);
  Pdf.ReplaceLoadedDocumentText('Acme Corp', 'Apex Corp', True, Replaced);
  if Replaced < Expected then
    WriteLn(Format('%d occurrence(s) skipped: characters missing ' +
      'from the font subset, or match spans multiple operands',
      [Expected - Replaced]));
end;

The second skip condition in that message is the other documented boundary: a needle that spans multiple string operands — Hello split across [(He)(llo)] TJ items, for instance — is found by search, because search matches the decoded glyph sequence, but is skipped by replace, because rewriting across operand boundaries would require merging adjacent byte spans. Search-then-verify makes both limits visible instead of silent

What changes in the file when you save?

A replaced /Contents stream is saved uncompressed. FlateDecode-compressed streams are decompressed for editing, and when HotPDF writes the rebuilt bytes it drops the stream's /Filter entry and refreshes /Length rather than recompressing. The resulting PDF is fully valid and renders normally in mainstream viewers; the trade-off is a larger file for each edited stream. For a batch pipeline that processes thousands of documents, budget for that growth or run a separate compression pass downstream. How rewritten objects interact with the document's cross-reference structure on save is its own topic, covered in object streams and incremental updates in HotPDF

Everything else about the file is left alone. Untouched streams keep their compression, fonts and images are not rewritten, and the operand-level splice means even the edited streams differ from the original only where a match landed. That conservatism is deliberate: the more of a loaded document a library rewrites, the more opportunities it has to break a producer quirk it did not anticipate

Text search and replace joins extraction, redaction, and page rendering in HotPDF's loaded-document toolset, all driven by the same content-stream interpreter and available from Delphi 5 through the current RAD Studio releases without external dependencies. The full API reference and trial download are on the HotPDF Component product page