PDF Fonts and Text: Why Glyphs Turn Into Boxes

A PDF that looks perfect on your machine and renders as a row of empty boxes on someone else's is the most common font defect in document software, and it almost never means the text is wrong. The characters are intact, the encoding is fine, the glyphs are simply not there. What changed between the two machines is which fonts the operating system had installed, and the gap between a portable file and a fragile one is one decision made when the page was written: whether the font traveled inside the PDF or was assumed present on the far end.

Understanding why that happens, and why a separate failure produces searchable-looking text that copies out as gibberish, means looking at how PDF stores text. It does not store sentences. It stores glyph codes plus a font program plus tables that map one to the other, and every rendering or extraction bug lives in a gap between those three. What follows is a tour of that machinery, grounded in ISO 32000, with the Delphi calls that control it where they matter.

Characters, codes, and glyphs are three different things

The vocabulary trips people up because everyday speech collapses three distinct ideas into the word "letter." A character is an abstract unit of writing, the idea of capital A, identified in Unicode as U+0041. A glyph is a drawn shape, the curve-and-stem outline a particular font uses to depict that character. Between them sits the code: the byte or bytes in the content stream that tell the viewer which glyph in the current font to paint.

PDF works in codes. When a content stream shows a string, those bytes are indices into the active font, not Unicode. The font's encoding decides that a code of 65 means "draw the glyph filed under 65," and nothing in that operation knows the result looks like an A to a human. That is what makes PDF render identically everywhere it can find the glyphs, and it is also why extraction is a separate problem from display: drawing needs only code-to-glyph, reading needs code-to-Unicode, and those are two different tables that can disagree or go missing independently.

The font types you will actually meet

ISO 32000 defines several font dictionary types, and in practice a document you receive or generate uses one of three. Knowing which one you are looking at explains most of what can go wrong.

Type 1 is Adobe's original PostScript outline format, built from cubic Bezier curves. The fourteen standard fonts every conforming reader must supply, the Helvetica, Times, Courier, Symbol, and ZapfDingbats families, are Type 1, and a font dictionary that names one of them can legally omit the font program. That is the one case where leaving a font unembedded is safe by specification rather than by luck. For any other Type 1 face the program has to be embedded or the viewer substitutes something, usually a metrically similar but visibly different font.

TrueType uses quadratic curves and came from the Apple and Microsoft world. It is what most system fonts are, and what you will embed most often. A simple TrueType font in PDF is limited to single-byte codes, so one such font can address at most 256 glyphs at a time. That cap is the structural reason CJK and other large scripts cannot ride on a simple font.

Type 0, the composite or CID-keyed font, is the answer to that limit. It uses multi-byte codes and a CMap to route them through a descendant CIDFont, whose outlines are themselves either TrueType or CFF/Type 1. This is the only font type that can carry thousands of glyphs, so any PDF holding Chinese, Japanese, Korean, or a wide multilingual mix is using Type 0 whether the author thought about it or not. The trade is complexity: more moving parts, more of which have to be correct for both rendering and extraction.

One TrueType font rendered at 12, 18, 24, and 36 points in a PDF, showing that a single embedded outline scales to any size

One detail behind that picture drives file size. A font is a library of outlines, not fixed-size bitmaps, so the same embedded program serves every point size on the page. Scaling is a transform applied at draw time, which is why a heading and its body text share one embedded face and why embedding cost is per font, not per size.

Embedding is the difference between portable and fragile

Embedding means the font program, the actual outline data, is written into the PDF as a stream. A reader on a machine that has never heard of your font reads those outlines straight from the file and draws exact glyphs. Skip embedding and you are betting the destination has a font of the same name; when it does not, the viewer falls back to a substitute. For the standard fourteen that substitution is defined and benign. For everything else it ranges from a near-miss in a different typeface to the empty-box result when no substitute covers the script at all.

With HotPDF the control is a single property, set before the document opens. FontEmbedding tells the library to pack the faces it draws with into the file:

var
  Pdf: THotPDF;
begin
  Pdf := THotPDF.Create(nil);
  try
    Pdf.FileName := 'report.pdf';
    Pdf.Compression := cmFlateDecode;
    Pdf.FontEmbedding := True;          // outlines travel inside the file
    Pdf.BeginDoc;
    Pdf.CurrentPage.SetFont('Calibri', [], 11);
    Pdf.CurrentPage.TextOut(72, 760, 0, 'This renders the same on a machine without Calibri.');
    Pdf.EndDoc;
  finally
    Pdf.Free;
  end;
end;

The ordering is not cosmetic. BeginDoc is where HotPDF commits the document structure, so FontEmbedding has to be true before that call. Assign it afterward and there is no error, no warning, just a file that quietly went out without its fonts. That is the worst kind of bug: it passes every test on the developer's machine, where the font happens to be installed, and only surfaces on a customer's, where it is not.

Embedding is also where licensing meets engineering. A font program carries flags describing whether it may be embedded freely, for preview only, or not at all. Honoring those flags is your responsibility, not the renderer's, and "it worked" is not the same as "it was allowed."

Subsetting: embed only the glyphs you used

Full embedding writes the entire font program into the file. A large CJK TrueType face can run to several megabytes, and embedding it whole to show a dozen characters is wasteful in a way that compounds across a multi-page document. Subsetting solves this by writing only the glyphs the document references, then renaming the font with a six-letter tag and a plus sign, the ABCDEF+Calibri form in any subsetted PDF's font listing, so a reader never confuses the partial face with a full system font of the same name.

For most generated documents subsetting is the right default. It keeps file size proportional to content rather than to the source font, which matters most for the large multilingual fonts that would otherwise dominate the file. The one caveat is that a subset only contains what was used at creation time. If a downstream process tries to add text to a subsetted font later, the glyphs it needs may not be in the file, a real constraint on incremental editing of someone else's PDF.

Unicode fonts and the CJK box problem

When the text is not plain Latin the simple-font path runs out, and the fix is to register a Unicode-capable font explicitly and let HotPDF build a Type 0 font from it. RegisterUnicodeTTF loads a TrueType file by path; after that the registered name is usable in SetFont like any other:

Pdf.FontEmbedding := True;
Pdf.RegisterUnicodeTTF('C:\Fonts\NotoSansCJKsc-Regular.ttf');
Pdf.BeginDoc;
Pdf.CurrentPage.SetFont('NotoSansCJKsc-Regular', [], 14);
Pdf.CurrentPage.TextOut(72, 720, 0, '你好，世界 こんにちは 안녕하세요');
Pdf.EndDoc;

Two things make or break this. The font has to cover the scripts in the string: a Latin-only TrueType will not grow Chinese glyphs because you asked it to, and the result is empty boxes again, this time because the glyph genuinely does not exist in that face. And embedding has to stay on, because a Type 0 font assembled from a registered TTF is meaningless to a reader that cannot find the outlines. For mixed content the durable choice is a broad-coverage face, the Noto and Arial Unicode MS families being the usual answers, embedded and subsetted.

Right-to-left and complex scripts add a shaping layer on top of coverage. HotPDF exposes RtLTextOut for Arabic and Hebrew, which handles the directional reordering so you pass logical order and let the library lay it out. Getting Arabic right is coverage plus shaping plus direction, three separate things, and a box there can mean any one of them failed.

The ToUnicode table: where copy-paste lives

Everything above concerns drawing. Extraction is the mirror image and fails for its own reasons. A viewer renders a page using the font's code-to-glyph mapping, but when a user selects text and copies it, the viewer needs to turn those same codes back into Unicode. That reverse mapping is the ToUnicode CMap, an optional stream attached to the font.

When it is present and correct, copied text comes out as the right characters. When it is absent or wrong, or the font was subsetted with custom glyph codes and no ToUnicode was written, the page looks perfect and the clipboard fills with garbage: glyph codes read as if they were Unicode, which for a custom-encoded subset they are not. This is why a scanned document with an OCR text layer can be searchable while a born-digital PDF from a careless generator is not. Rendering and extraction draw on different tables, so a file can satisfy one and fail the other. If extraction matters for your output, treat a correct ToUnicode map as a requirement, and verify it by copying text out of a sample rather than trusting it is there.

How to diagnose a font bug fast

The failure mode tells you where to look. Empty boxes on another machine almost always mean a font that was not embedded, so check embedding first and glyph coverage second. Boxes that appear even on your own machine point at coverage: the font does not contain that script, regardless of embedding. Text that renders correctly but copies out as nonsense is a ToUnicode problem, not a rendering one, and fiddling with fonts or embedding will not fix it because the drawing was never broken. To read a finished file, open it in Acrobat and look at Document Properties, Fonts: a healthy entry shows the type, says Embedded or Embedded Subset, and names the encoding. A font that should be embedded and is not announces itself there before a customer does.

None of this is exotic once the split between character, code, and glyph is clear. Embed the fonts you draw with, subset the large ones, reach for a Unicode face and RegisterUnicodeTTF the moment text leaves Latin, and keep a correct ToUnicode map if anyone will extract the text. Get those right and the boxes stop appearing. For the surrounding mechanics, the anatomy of a minimal PDF shows where the font dictionary sits in the object tree, and the document structure walkthrough covers how resources are shared across pages.

The SetFont, FontEmbedding, and RegisterUnicodeTTF calls shown here are part of the HotPDF Component for Delphi and C++Builder.