Tokenization
Detailed solutions for the exercises in Chapter 14. Try solving them yourself before checking the answers.
Solution
Character-level: ~1000 tokens. Word-level (English averages ~5 chars/word plus spaces): ~170 tokens. Since attention is O(T²), the cost ratio is (1000/170)² ≈ 35× more expensive for character-level. The tension: characters give a tiny vocabulary and no OOV but very long sequences; words give short sequences but huge vocabularies and OOV problems. Subword tokenization (BPE) is the compromise that captures most of both benefits.
Solution
Initial symbols: a, b, c. Pair counts: 'a b' appears 3×2 + 2×1 = 8 times, 'b a' 3×1 = 3, 'b c' 2. Merge 1: (a,b)→'ab'. Now 'ab ab':3, 'ab c':2. Pair counts: ('ab','ab')=3, ('ab','c')=2. Merge 2: (ab,ab)→'abab' (count 3). Merge 3: (ab,c)→'abc' (count 2). Vocabulary grows to {a, b, c, ab, abab, abc}. BPE greedily merges the most frequent adjacent pair each step.
Solution
Byte-level BPE starts from the 256 possible byte values (2⁸) — every conceivable file, in any language or script, is ultimately a sequence of bytes drawn from those 256. Since any input decomposes into known bytes, no token is ever truly out-of-vocabulary; in the worst case a rare word falls back to its individual bytes. The base is exactly 256 because a byte has 8 bits, covering all possible raw character encodings (UTF-8 included).
Solution
Dividing by individual frequencies and taking logs gives log[P(a,b)/(P(a)P(b))] = PMI(a,b) (Chapter 4), up to corpus-size constants. So WordPiece merges the pair with the highest pointwise mutual information — the pair that co-occurs far more than chance would predict — rather than merely the most frequent pair (as raw BPE does). This favors merges that are genuinely informative, not just common.
Solution
BPE starts from characters and greedily merges frequent pairs (bottom-up); Unigram starts from a large candidate vocabulary and prunes tokens that least hurt the corpus likelihood (top-down, probabilistic). For a word like 'unhappiness', BPE might commit early to a frequent merge ('un'+'happiness') based on local frequency, while Unigram picks the segmentation maximizing overall likelihood, possibly 'un'+'happy'+'ness' — giving more morphologically clean splits in some cases. They can disagree because one is greedy/deterministic and the other is global/probabilistic.
Solution
V=50k: 50000×768 ≈ 38.4M parameters (≈31% of a 124M model). V=256k: 256000×768 ≈ 196.6M — larger than the entire 124M model. A bigger vocabulary shortens sequences and improves multilingual coverage but bloats the embedding/LM-head matrices, so vocabulary size is a real architectural trade-off, especially for smaller models.
Solution
If '123', '1234', and '45' tokenize into irregular, inconsistent pieces, the model can't see digit place-value structure and must memorize arithmetic per token pattern — brittle and error-prone. A fix is to tokenize numbers digit-by-digit (or in fixed groups) so place value is explicit and consistent; models such as LLaMA (and others adopting per-digit number handling) do this, which measurably improves arithmetic.
Solution
BPE merges are learned from the training corpus, which is English-dominated. So English words get efficient multi-character tokens, while under-represented scripts like Burmese rarely trigger merges and fall back to bytes/characters — many tokens per word. The chain: skewed corpus → merges favor frequent (English) patterns → rare-language text fragments into many small tokens → token inflation (and higher cost/worse context use) for those languages.
Solution
Glitch tokens (e.g. ' SolidGoldMagikarp') are tokens the tokenizer created from training-corpus artifacts (scraped usernames, forum strings) that almost never appeared in the LM's actual training text. The tokenizer reserved an embedding slot, but the model rarely or never updated that embedding during training, so it stays near its random initialization — producing bizarre, unpredictable behavior when invoked. A token can exist in the vocabulary yet be semantically untrained.
Solution
V sets (a) the embedding matrix size V×d, (b) the LM-head size d×V (often tied to the embedding), (c) the output softmax cost O(V) per position, and (d) inversely, the sequence length — a larger vocabulary means fewer tokens per text. So V trades embedding/softmax cost against sequence length and coverage; it is 'load-bearing' because it touches memory, compute, multilingual fairness, and effective context simultaneously.
Solution
Implement get_pair_counts (count adjacent symbol pairs across the corpus), merge_pair (replace the chosen pair with a new symbol everywhere), and loop, recording each merge. The first merges capture the most frequent pairs (common letter combos like 'th', 'in', 'er'), demonstrating the greedy bottom-up vocabulary construction of Exercise 2.
Solution
Encoding applies the learned merges in priority order to a new word's characters. Words seen in training collapse to few tokens; unseen words still tokenize — they just fall back to smaller subword pieces (or bytes) — demonstrating BPE's graceful handling of novel words without OOV failures.
Solution
Tokenizing the same passage translated into several languages and dividing each token count by the English count shows inflation factors well above 1 for under-represented or non-Latin-script languages — the empirical version of Exercise 8, with implications for cost and effective context length across languages.
Solution
Running the three tokenizers (GPT-2 BPE, BERT WordPiece, T5 Unigram) on identical text yields different token counts and split points — e.g. WordPiece marks continuations with '##', Unigram tends toward morphologically cleaner pieces. The table makes concrete that 'tokens' are tokenizer-specific, affecting both counts and where word boundaries fall (Exercises 4–5).
Solution
Plotting tokens-per-number under the GPT-2 tokenizer shows an irregular, non-monotonic pattern (some numbers are single tokens, others split unpredictably), while a per-digit scheme gives a clean, regular count proportional to the number of digits — visually motivating the arithmetic fix of Exercise 7.
Solution
apply_chat_template inserts the model's exact role markers and special tokens (e.g. system/user/assistant delimiters). Building the prompt by hand with the wrong tokens or spacing produces a subtly different token sequence, which can noticeably degrade an instruction-tuned model's behavior — demonstrating that matching the training-time template precisely matters.
Solution
Computing the L2 norm of each row of the embedding matrix and listing the smallest-norm tokens surfaces candidates that were barely trained — often the glitch tokens of Exercise 9. Anomalously small norms indicate embeddings still near initialization, a practical detector for these failure-prone tokens.
Solution
Training byte-level BPE on English + a non-Latin language and measuring the token-inflation factor between them quantifies the imbalance of Exercise 8. Retraining with a larger vocabulary lets more merges cover the second language, shrinking the inflation factor — demonstrating that vocabulary size and corpus balance jointly govern cross-lingual fairness.