Solutions Appendix

Chapter 14

Tokenization

18 Solutions

Detailed solutions for the exercises in Chapter 14. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper

Character vs word tokenization: sequence length and attention cost for a 1000-char doc.

Solution

Character-level: ~1000 tokens. Word-level (English averages ~5 chars/word plus spaces): ~170 tokens. Since attention is O(T²), the cost ratio is (1000/170)² ≈ 35× more expensive for character-level. The tension: characters give a tiny vocabulary and no OOV but very long sequences; words give short sequences but huge vocabularies and OOV problems. Subword tokenization (BPE) is the compromise that captures most of both benefits.

Exercise 2Pen & Paper

Trace BPE on {'a b a b':3, 'a b c':2}; show the first three merges and vocabulary.

Solution

Initial symbols: a, b, c. Pair counts: 'a b' appears 3×2 + 2×1 = 8 times, 'b a' 3×1 = 3, 'b c' 2. Merge 1: (a,b)→'ab'. Now 'ab ab':3, 'ab c':2. Pair counts: ('ab','ab')=3, ('ab','c')=2. Merge 2: (ab,ab)→'abab' (count 3). Merge 3: (ab,c)→'abc' (count 2). Vocabulary grows to {a, b, c, ab, abab, abc}. BPE greedily merges the most frequent adjacent pair each step.

Exercise 3Pen & Paper

Why does byte-level BPE guarantee no OOV? Why base vocab 256?

Solution

Byte-level BPE starts from the 256 possible byte values (2⁸) — every conceivable file, in any language or script, is ultimately a sequence of bytes drawn from those 256. Since any input decomposes into known bytes, no token is ever truly out-of-vocabulary; in the worst case a rare word falls back to its individual bytes. The base is exactly 256 because a byte has 8 bits, covering all possible raw character encodings (UTF-8 included).

Exercise 4Pen & Paper

Show WordPiece's count(a,b)/(count(a)·count(b)) is proportional to PMI. Connect to Chapter 4.

Solution

Dividing by individual frequencies and taking logs gives log[P(a,b)/(P(a)P(b))] = PMI(a,b) (Chapter 4), up to corpus-size constants. So WordPiece merges the pair with the highest pointwise mutual information — the pair that co-occurs far more than chance would predict — rather than merely the most frequent pair (as raw BPE does). This favors merges that are genuinely informative, not just common.

Exercise 5Pen & Paper

Compare BPE (build up) and Unigram (prune down); scenario where segmentations differ.

Solution

BPE starts from characters and greedily merges frequent pairs (bottom-up); Unigram starts from a large candidate vocabulary and prunes tokens that least hurt the corpus likelihood (top-down, probabilistic). For a word like 'unhappiness', BPE might commit early to a frequent merge ('un'+'happiness') based on local frequency, while Unigram picks the segmentation maximizing overall likelihood, possibly 'un'+'happy'+'ness' — giving more morphologically clean splits in some cases. They can disagree because one is greedy/deterministic and the other is global/probabilistic.

Exercise 6Pen & Paper

Embedding parameters for V=50k vs 256k at d=768. Fraction of a 124M model?

Solution

V=50k: 50000×768 ≈ 38.4M parameters (≈31% of a 124M model). V=256k: 256000×768 ≈ 196.6M — larger than the entire 124M model. A bigger vocabulary shortens sequences and improves multilingual coverage but bloats the embedding/LM-head matrices, so vocabulary size is a real architectural trade-off, especially for smaller models.

Exercise 7Pen & Paper

Why does inconsistent number tokenization harm arithmetic? Propose a fix; name a model.

Solution

If '123', '1234', and '45' tokenize into irregular, inconsistent pieces, the model can't see digit place-value structure and must memorize arithmetic per token pattern — brittle and error-prone. A fix is to tokenize numbers digit-by-digit (or in fixed groups) so place value is explicit and consistent; models such as LLaMA (and others adopting per-digit number handling) do this, which measurably improves arithmetic.

Exercise 8Pen & Paper

Why does the same sentence cost more tokens in Burmese than English?

Solution

BPE merges are learned from the training corpus, which is English-dominated. So English words get efficient multi-character tokens, while under-represented scripts like Burmese rarely trigger merges and fall back to bytes/characters — many tokens per word. The chain: skewed corpus → merges favor frequent (English) patterns → rare-language text fragments into many small tokens → token inflation (and higher cost/worse context use) for those languages.

Exercise 9Pen & Paper

Explain glitch tokens. Why can a token exist yet have a near-random embedding?

Solution

Glitch tokens (e.g. ' SolidGoldMagikarp') are tokens the tokenizer created from training-corpus artifacts (scraped usernames, forum strings) that almost never appeared in the LM's actual training text. The tokenizer reserved an embedding slot, but the model rarely or never updated that embedding during training, so it stays near its random initialization — producing bizarre, unpredictable behavior when invoked. A token can exist in the vocabulary yet be semantically untrained.

Exercise 10Pen & Paper

Trace how vocabulary size V appears in the embedding, LM head, softmax cost, and sequence length.

Solution

V sets (a) the embedding matrix size V×d, (b) the LM-head size d×V (often tied to the embedding), (c) the output softmax cost O(V) per position, and (d) inversely, the sequence length — a larger vocabulary means fewer tokens per text. So V trades embedding/softmax cost against sequence length and coverage; it is 'load-bearing' because it touches memory, compute, multilingual fairness, and effective context simultaneously.

Exercise 11Code

Implement BPE training from scratch; print the first 20 merges.

Solution

Implement get_pair_counts (count adjacent symbol pairs across the corpus), merge_pair (replace the chosen pair with a new symbol everywhere), and loop, recording each merge. The first merges capture the most frequent pairs (common letter combos like 'th', 'in', 'er'), demonstrating the greedy bottom-up vocabulary construction of Exercise 2.

Exercise 12Code

Implement BPE encoding with learned merges; confirm unseen words still tokenize.

Solution

Encoding applies the learned merges in priority order to a new word's characters. Words seen in training collapse to few tokens; unseen words still tokenize — they just fall back to smaller subword pieces (or bytes) — demonstrating BPE's graceful handling of novel words without OOV failures.

Exercise 13Code

Use tiktoken to compare token counts across 5 languages; plot inflation vs English.

Solution

Tokenizing the same passage translated into several languages and dividing each token count by the English count shows inflation factors well above 1 for under-represented or non-Latin-script languages — the empirical version of Exercise 8, with implications for cost and effective context length across languages.

Exercise 14Code Lab

Compare BPE, WordPiece, Unigram segmentations on the same text; tabulate and find differences.

Solution

Running the three tokenizers (GPT-2 BPE, BERT WordPiece, T5 Unigram) on identical text yields different token counts and split points — e.g. WordPiece marks continuations with '##', Unigram tends toward morphologically cleaner pieces. The table makes concrete that 'tokens' are tokenizer-specific, affecting both counts and where word boundaries fall (Exercises 4–5).

Exercise 15Code

Demonstrate the number problem: tokens-per-number 0–1000 with GPT-2 vs per-digit.

Solution

Plotting tokens-per-number under the GPT-2 tokenizer shows an irregular, non-monotonic pattern (some numbers are single tokens, others split unpredictably), while a per-digit scheme gives a clean, regular count proportional to the number of digits — visually motivating the arithmetic fix of Exercise 7.

Exercise 16Code

Use apply_chat_template; show inserted special tokens; contrast a wrong hand-built template.

Solution

apply_chat_template inserts the model's exact role markers and special tokens (e.g. system/user/assistant delimiters). Building the prompt by hand with the wrong tokens or spacing produces a subtly different token sequence, which can noticeably degrade an instruction-tuned model's behavior — demonstrating that matching the training-time template precisely matters.

Exercise 17Code

Audit a tokenizer for glitch tokens via per-token embedding norms.

Solution

Computing the L2 norm of each row of the embedding matrix and listing the smallest-norm tokens surfaces candidates that were barely trained — often the glitch tokens of Exercise 9. Anomalously small norms indicate embeddings still near initialization, a practical detector for these failure-prone tokens.

Exercise 18Code (Challenge)

Train a byte-level BPE on a multilingual corpus; measure inflation; show it shrinks with larger vocab.

Solution

Training byte-level BPE on English + a non-Latin language and measuring the token-inflation factor between them quantifies the imbalance of Exercise 8. Retraining with a larger vocabulary lets more merges cover the second language, shrinking the inflation factor — demonstrating that vocabulary size and corpus balance jointly govern cross-lingual fairness.

←

ReturnAppendix Index

ReviewBack to Chapter 14

→