Part III: The Transformer
Chapter 14

Layer Normalization & Residuals

Pre-norm, post-norm, and why residuals matter
18 Exercises
14.1

In Chapter 13 you built a Transformer that maps token IDs to predictions — but where do token IDs come from? Tokenization is the process that turns raw text into the sequence of integers a model consumes, and converts the model's integer outputs back into text. It is the unglamorous interface layer that every other part of the model depends on, and its design choices ripple through everything from arithmetic ability to multilingual fairness.

The Fundamental Tension

Tokenization must choose a unit of text. The two obvious choices both fail, and the failure of each motivates the subword compromise that all modern tokenizers use.

Character-level tokensWord-level tokens
Tiny vocabulary (~256 bytes)Huge vocabulary (millions of words)
Never out-of-vocabularyOut-of-vocabulary words break it
Sequences are very longSequences are short
Model must learn spelling from scratchNo sharing across word forms
Wastes context on trivial structure'run' and 'running' unrelated
Slow: O(T²) attention over long TCannot handle typos, new words

Characters give a tiny vocabulary and never fail on unseen text, but produce very long sequences — and attention is quadratic in length. Words give short sequences but an unbounded vocabulary, and shatter on any unseen word, typo, or morphological variant. Subword tokenization splits the difference: common words stay whole, rare words decompose into meaningful pieces.

Intuition: Subwords: The Goldilocks Unit
Subword tokenization gives common words their own token ('the', 'running') while breaking rare words into reusable pieces ('tokenization' → 'token' + 'ization'). The vocabulary stays bounded (typically 30k–256k), sequences stay reasonably short, and unseen words always decompose into known pieces — no out-of-vocabulary failures.
This is the same subword idea you met in fastText (Chapter 8), but applied differently: fastText summed n-gram vectors into one word vector, while BPE feeds each subword as a separate token to the Transformer, which composes them contextually.
14.2

Byte-Pair Encoding, originally a 1994 data-compression algorithm, was adapted for tokenization by Sennrich et al. (2016). It is the most widely used tokenization method, powering GPT-2, GPT-3, GPT-4, and many others. The idea is elegantly simple: start with individual characters and repeatedly merge the most frequent adjacent pair into a new token.

The BPE Training Algorithm

textBPE training (learn the merges) (Pseudocode)
# Start: every word is a sequence of characters
vocab ← all individual characters in the corpus
represent each word as a list of characters + end marker

repeat until vocab reaches target size:
    count all adjacent symbol pairs across the corpus
    pair ← most frequent adjacent pair
    merge pair into a new symbol; add to vocab
    record the merge rule (order matters!)

return vocab and ordered list of merge rules

A worked example on a tiny corpus shows the mechanism. Suppose the word 'low' appears 5 times and 'lower' twice. BPE counts adjacent pairs, finds 'l'+'o' is most frequent, merges it to 'lo', then 'lo'+'w' to 'low', and so on — building up common substrings into single tokens.

PythonCode Lab: BPE training from scratch
import collections, re

def get_pair_counts(word_freqs):
    """Count adjacent symbol pairs, weighted by word frequency."""
    pairs = collections.defaultdict(int)
    for word, freq in word_freqs.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

def merge_pair(pair, word_freqs):
    """Replace every occurrence of `pair` with the merged symbol."""
    merged = {}
    bigram = re.escape(' '.join(pair))
    pattern = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in word_freqs:
        new = pattern.sub(''.join(pair), word)
        merged[new] = word_freqs[word]
    return merged

def train_bpe(word_freqs, n_merges):
    """word_freqs: {space-separated chars: count}. Returns merge list."""
    merges = []
    for _ in range(n_merges):
        pairs = get_pair_counts(word_freqs)
        if not pairs: break
        best = max(pairs, key=pairs.get)       # most frequent pair
        word_freqs = merge_pair(best, word_freqs)
        merges.append(best)
    return merges

# Tiny corpus: words split into chars with end-of-word marker </w>
corpus = {
    'l o w </w>': 5, 'l o w e r </w>': 2,
    'n e w e s t </w>': 6, 'w i d e s t </w>': 3,
}
merges = train_bpe(corpus, n_merges=10)
print(merges[:5])
# [('e','s'), ('es','t'), ('est','</w>'), ('l','o'), ('lo','w')]
# 'est</w>' became one token because it recurs in 'newest' and 'widest'

Encoding New Text

Once trained, encoding applies the learned merges in order: split text into characters, then repeatedly apply each merge rule. The order matters — merges learned earlier are higher priority. Decoding is trivial: concatenate the token strings.

PythonBPE encoding with learned merges
def encode_bpe(word, merges):
    """Apply learned merges in priority order to tokenize a word."""
    symbols = list(word) + ['</w>']
    for pair in merges:       # merges are in priority order
        i = 0
        while i < len(symbols)-1:
            if (symbols[i], symbols[i+1]) == pair:
                symbols[i:i+2] = [''.join(pair)]  # merge in place
            else:
                i += 1
    return symbols

print(encode_bpe('lowest', merges))
# ['low', 'est</w>']  -- 'low' and 'est' were learned as units
# Even though 'lowest' never appeared in training, it decomposes cleanly.
14.3

Character-level BPE has a hidden flaw: the set of possible characters is enormous (Unicode has ~150,000 code points) and open-ended (new emoji appear). If a character was never seen in training, it cannot be tokenized. GPT-2 solved this elegantly: run BPE over raw bytes instead of characters.

The Byte-Level Guarantee

Every piece of text, in any language or script, is ultimately a sequence of bytes — and there are only 256 possible byte values. By running BPE on bytes, the base vocabulary is exactly 256 tokens, and any text whatsoever can be represented. There are no out-of-vocabulary tokens, ever, by construction.

textWhy byte-level is universal
Any text  →  UTF-8 bytes  →  sequence in {0, ..., 255}

Base vocab = 256 byte tokens  (always complete)
+ learned merges of frequent byte sequences
⇒ zero out-of-vocabulary tokens, any language, any emoji, any symbol
No OOV Tokens, Ever
Byte-level BPE makes the out-of-vocabulary problem disappear entirely. A never-before-seen emoji, a rare Chinese character, a binary blob — all decompose into bytes, and every byte is in the vocabulary. The model might tokenize them inefficiently (many bytes per character), but it can always represent them.
This is why GPT-2 and successors use byte-level BPE. The trade-off: non-English text, which uses multi-byte UTF-8 encodings, costs more tokens per character — a fairness issue we examine in Section 14.8.
PythonByte-level tokenization with tiktoken (GPT's tokenizer)
# pip install tiktoken
import tiktoken

# GPT-4's tokenizer
enc = tiktoken.get_encoding('cl100k_base')

text = 'Tokenization shapes everything.'
ids  = enc.encode(text)
print(ids)                        # [3404, 2065, 13745, 5238, 13]
print([enc.decode([i]) for i in ids])
# ['Token', 'ization', ' shapes', ' everything', '.']
# Note: ' shapes' includes the leading space -- spaces attach to words

# Token count varies wildly by content type
print(len(enc.encode('hello world')))        # 2 tokens
print(len(enc.encode('你好世界')))            # 6 tokens (Chinese costs more)
print(len(enc.encode('1234567890')))        # 4 tokens (digits split oddly)

# Rule of thumb for English: ~1 token ≈ 0.75 words ≈ 4 characters
14.4

BPE is the most common tokenizer, but two important alternatives exist. WordPiece (used by BERT) merges by likelihood rather than raw frequency. Unigram (used by many SentencePiece models) takes the opposite approach — starting with a large vocabulary and pruning it down probabilistically.

WordPiece: Merge by Likelihood

WordPiece is nearly identical to BPE but changes the merge criterion. Instead of merging the most frequent pair, it merges the pair that most increases the likelihood of the training data — effectively the pair whose merged frequency most exceeds what you'd expect from its parts appearing independently.

textWordPiece merge score
BPE:       merge argmax  count(a, b)
WordPiece: merge argmax  count(a,b) / (count(a) · count(b))

# WordPiece prefers pairs that co-occur more than chance would predict
# (this is pointwise mutual information, from Chapter 4)

Unigram: Prune Down, Don't Build Up

Unigram (Kudo, 2018) inverts the process. It starts with a large candidate vocabulary, assigns each token a probability, and iteratively removes the tokens whose loss of likelihood is smallest — pruning until the target size is reached. At encoding time, it finds the most probable segmentation of the text under the unigram language model.

MethodDirectionMerge/keep criterionUsed in
BPEBuild upMost frequent pairGPT-2/3/4, RoBERTa
WordPieceBuild upHighest likelihood gainBERT, DistilBERT, Electra
UnigramPrune downSmallest likelihood lossT5, ALBERT, mBART, XLNet
Word/SentencePieceWrapperRaw text → either aboveMany multilingual models
ML Connection: SentencePiece Is a Wrapper, Not an Algorithm
A common confusion: SentencePiece is often listed alongside BPE and Unigram as if it were a third algorithm. It is not — SentencePiece is a library (Kudo & Richardson, 2018) that implements BOTH BPE and Unigram, with a key innovation: it treats the input as a raw stream including spaces (encoding space as a special ▁ character), so it needs no language-specific pre-tokenization.
This makes SentencePiece truly language-agnostic — it works identically on English (space-separated) and Chinese/Japanese (no spaces). It is the standard choice for multilingual models like mBART and many of Google's models.
14.5

Vocabulary size is one of the most important tokenizer hyperparameters, and it trades off two competing costs. A larger vocabulary tokenizes text into fewer tokens (shorter sequences, cheaper attention) but requires a larger embedding matrix and softmax (more parameters, more compute in the output layer).

The Two Competing Costs

Larger vocabularySmaller vocabulary
Fewer tokens per text (shorter sequences)More tokens per text (longer sequences)
Cheaper attention (smaller T)More expensive attention (larger T)
Larger embedding & softmax matricesSmaller embedding & softmax matrices
Rare tokens get few training updatesTokens are more frequent, better trained
More of each language fits in contextContext fills with subword fragments

Typical Vocabulary Sizes

ModelVocab sizeTokenizer
BERT30,522WordPiece
GPT-250,257Byte-level BPE
GPT-350,257Byte-level BPE
GPT-4 (cl100k)~100,277Byte-level BPE
LLaMA-232,000SentencePiece BPE
LLaMA-3128,256Byte-level BPE (tiktoken)
Gemma256,000SentencePiece (large multilingual)

The trend over time is toward larger vocabularies. LLaMA-3 quadrupled LLaMA-2's vocabulary (32k → 128k), and Gemma uses 256k. Larger vocabularies improve multilingual coverage and tokenization efficiency, and at large model scale the extra embedding parameters are a small fraction of the total.

Train Note: The Embedding Matrix Can Dominate Small Models
For a small model, the embedding and unembedding matrices can be a large fraction of all parameters. GPT-2 small has 124M parameters, of which the 50,257×768 embedding is 38M — over 30%. Weight tying (Chapter 13) helps by sharing input and output embeddings.
At large scale this inverts: for a 70B model, even a 128k vocab embedding (128,256×8192 ≈ 1B params) is under 2% of the total. This is why large models can afford big vocabularies while small models cannot.
14.6

LLMs are famously unreliable at arithmetic, and a surprising amount of the blame falls on tokenization. The way numbers get split into tokens is often inconsistent and counterintuitive, making it hard for the model to learn the place-value structure that arithmetic requires.

The Number-Splitting Problem

Consider how GPT-2's tokenizer splits numbers: '127' might be one token, but '128' three tokens, depending on which digit sequences happened to be frequent in training. The model sees no consistent representation of place value, so it cannot easily learn that the '1' in '127' means one hundred.

PythonInconsistent number tokenization
import tiktoken
enc = tiktoken.get_encoding('gpt2')

# Numbers split inconsistently -- no place-value structure
for n in ['127', '128', '1234', '12345', '1000000']:
    ids = enc.encode(n)
    print(f"{n:>8}: {len(ids)} tokens  {[enc.decode([i]) for i in ids]}")
#      127: 1 tokens  ['127']
#      128: 1 tokens  ['128']
#     1234: 2 tokens  ['12', '34']
#    12345: 3 tokens  ['123', '45']
#  1000000: 3 tokens  ['1', '000', '000']
# No consistent digit grouping -- place value is scrambled.

Modern tokenizers mitigate this. LLaMA splits every digit into its own token, giving a consistent per-digit representation. GPT-4's tokenizer groups digits in chunks of up to three, aligned to how humans write large numbers. These choices measurably improve arithmetic accuracy.

⚠️
Pitfall: Tokenization Artifacts Are Everywhere
Number arithmetic is the clearest case, but tokenization quietly distorts many tasks: reversing a string is hard because the model sees tokens not characters; counting letters in a word fails because the word is one opaque token; rhyming is impaired because phonetic structure is hidden inside tokens.
Whenever an LLM fails at a task that seems trivially character-level, suspect the tokenizer. The model never sees characters — it sees token IDs, and the character structure is hidden inside them.
ML Connection: The 'How Many R's in Strawberry' Problem
The viral failure where LLMs miscount letters in 'strawberry' is largely a tokenization artifact. 'strawberry' is tokenized as a few subword chunks (e.g. 'str' + 'aw' + 'berry'), and the model never sees the individual letters — so counting them requires reconstructing spelling it was never given directly.
This is why character-level and byte-level reasoning remains a weak spot even for capable models, and why some researchers advocate tokenizer-free, byte-level architectures (Chapter 33 touches on these).
14.7

Beyond ordinary text tokens, every tokenizer reserves special tokens that carry structural meaning: marking the start and end of sequences, padding, masking, and — for chat models — delineating turns between user and assistant. These tokens are how the model knows where a document begins, where a user's message ends, and when to stop generating.

Special tokenPurpose
<|endoftext|> / </s>Marks document or sequence boundaries
<bos> / <s>Beginning of sequence
[PAD]Padding to align batch lengths
[MASK]Masked position for BERT-style training
[CLS] / [SEP]Classification token / segment separator (BERT)
<|im_start|> / <|im_end|>Chat turn boundaries (ChatML format)
<|system|> <|user|> <|assistant|>Role markers in chat templates

Chat Templates

Instruction-tuned chat models wrap conversations in a specific template of special tokens. The model is trained to recognize these markers, so getting the template exactly right at inference is essential — a mismatched template can degrade quality dramatically. This is why you should use the tokenizer's built-in apply_chat_template rather than constructing prompts by hand.

PythonChat templates in practice
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-chat-hf')

messages = [
    {'role': 'system', 'content': 'You are helpful.'},
    {'role': 'user', 'content': 'What is 2+2?'},
]

# Let the tokenizer apply the model's exact chat template
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(prompt)
# <s>[INST] <<SYS>>
# You are helpful.
# <</SYS>>
# What is 2+2? [/INST]

# NEVER hand-build this -- the exact special tokens and whitespace matter.
# A wrong template can severely degrade a chat model's responses.
14.8

Tokenizers are trained on a corpus, and that corpus is overwhelmingly English. The consequence is a quiet but real unfairness: the same meaning costs far more tokens in some languages than others, which translates directly into higher latency, higher cost, and less effective context for speakers of under-represented languages.

The Token Inflation Problem

Because BPE merges are learned from frequency, English subwords dominate the vocabulary. Languages with different scripts (Thai, Burmese, Telugu) or that were under-represented in training fragment into many more tokens for the same content. A sentence that costs 10 tokens in English might cost 30–50 in a low-resource language.

LanguageRelative token costWhy
English1.0× (baseline)Tokenizer trained mostly on English
Spanish, French~1.1–1.3×Latin script, well-represented
Chinese, Japanese~1.5–2.5×Multi-byte chars, fewer merges
Hindi, Arabic~2–3×Different scripts, under-represented
Burmese, Telugu~5–10×Rare scripts, near byte-level
⚠️
Tokenization Is a Fairness Issue
Since API pricing is per-token and context windows are measured in tokens, a speaker of a high-inflation language pays several times more for the same task and fits several times less text in the context window. Ahia et al. (2023) and Petrov et al. (2023) documented inflation factors up to 15× for some languages.
This is not a minor technical detail — it is a structural inequity baked into the interface layer. Larger, more multilingual vocabularies (Gemma's 256k, LLaMA-3's 128k) partly address it, which is one reason vocabulary sizes are growing.
ML Connection: Why Vocabularies Are Growing
The move from 32k (LLaMA-2) to 128k (LLaMA-3) to 256k (Gemma) vocabularies is driven substantially by multilingual fairness and efficiency. A larger vocabulary can dedicate tokens to non-English subwords, reducing the inflation factor for those languages.
The cost — a larger embedding matrix — is increasingly affordable at large model scale, making bigger, more equitable vocabularies the clear trend for frontier models.
14.9

One of the strangest tokenization phenomena is glitch tokens: tokens that exist in the vocabulary but were almost never seen during model training, leaving their embeddings essentially random. When prompted with such a token, the model behaves erratically — refusing to repeat it, hallucinating, or producing nonsense.

The SolidGoldMagikarp Phenomenon

The most famous example is 'SolidGoldMagikarp' — a Reddit username that became a single GPT-2/GPT-3 token because it appeared frequently in the tokenizer-training data (Reddit) but rarely in the model-training data. Its embedding was barely updated, so the model could not process it: asked to repeat 'SolidGoldMagikarp', GPT-3 would output unrelated words, evade, or break.

Intuition: Why Glitch Tokens Happen
The tokenizer and the model are trained on different (though overlapping) data. A token can earn a vocabulary slot because it was frequent in the TOKENIZER corpus, yet receive almost no gradient updates because it was rare in the MODEL training corpus. Its embedding stays near its random initialization — a vector the model never learned to interpret.
Feeding such a token is like asking the model about a word in a language it never learned: the input vector points in a direction the rest of the network has no sensible response to, producing the characteristic glitchy behaviour.
PathologyCause
Glitch tokensToken frequent in tokenizer data, rare in training data
Trailing-space sensitivity' the' and 'the' are different tokens; prompts can mismatch
Tokenization of own outputModel output re-tokenized differently than intended
Prompt-boundary effectsWhere a word splits depends on surrounding context
Repeated-token degenerationSome token sequences trigger repetitive loops
Train Note: Auditing for Glitch Tokens
Modern training pipelines audit for under-trained tokens by checking embedding norms: glitch tokens often have anomalously small embedding norms because they received few updates. Some teams remove or down-weight such tokens before release.
The lesson generalizes: the tokenizer and model must be trained on consistent data, and the vocabulary should be validated against the actual training distribution — not just the tokenizer-training corpus.
14.10

Tokenization is the first and last step of every interaction with a language model. It is worth seeing the complete round trip — from raw text to token IDs to embeddings, through the model, and back to text — to appreciate how the tokenizer interfaces with everything you built in Chapters 12 and 13.

Arch Stack: The full text round-trip

output textdecode
token IDs (argmax/sample)(T,)
logits over vocabulary(T, V)
Transformer (Ch. 13)the model
embeddings(T, d)
token IDs(T,)
input textencode (BPE)

Shape Trace: Encode → model → decode (T tokens, vocab V, dim d)

OperationShapeNote
raw textstringhuman input
BPE encode(T,)integer token IDs
embedding lookup(T, d)E[token_ids]
Transformer(T, d)contextual representations
LM head(T, V)logits over vocabulary
sample / argmax(T,)predicted token IDs
BPE decodestringhuman-readable output
Tokenization Connects to Everything
The vocabulary size V determines the embedding matrix (V×d) and the LM head (d×V) of Chapter 13. The tokenizer's segmentation determines the sequence length T, which drives the O(T²) attention cost of Chapter 12. The per-token cross-entropy loss of Chapter 4 is computed over exactly these tokens.
Tokenization is not a preprocessing afterthought — it is a load-bearing design decision that propagates through the entire model. A tokenizer change requires retraining from scratch, which is why it is fixed early and rarely revisited.
14.11

Tokenizer Quick-Reference

MethodBuilds byUsed inNote
BPEMerging frequent pairsGPT familyMost common
Byte-level BPEBPE over 256 bytesGPT-2+, LLaMA-3No OOV ever
WordPieceMerging by likelihoodBERT## prefix for subwords
UnigramPruning a large vocabT5, ALBERTProbabilistic segmentation
SentencePieceWraps BPE/UnigramMultilingualLanguage-agnostic

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

Exercise 1: Pen & Paper
Explain the tension between character-level and word-level tokenization. For a 1000-character document, estimate the sequence length under each and the attention-cost ratio.
Exercise 2: Pen & Paper
Trace BPE training by hand on the corpus {'a b a b': 3, 'a b c': 2}. Show the first three merges and the resulting vocabulary.
Exercise 3: Pen & Paper
Why does byte-level BPE guarantee no out-of-vocabulary tokens? What is the base vocabulary size, and why exactly 256?
Exercise 4: Pen & Paper
Show that WordPiece's merge criterion count(a,b)/(count(a)·count(b)) is proportional to pointwise mutual information. Connect this to Chapter 4.
Exercise 5: Pen & Paper
Compare BPE (build up) and Unigram (prune down). Give a scenario where each might produce a different segmentation of the same word.
Exercise 6: Pen & Paper
For vocab V=50k, d=768, compute the embedding parameter count. Repeat for V=256k. What fraction of a 124M-param model is each?
Exercise 7: Pen & Paper
Explain why inconsistent number tokenization harms arithmetic. Propose a tokenization scheme for digits that would help, and name a model that uses it.
Exercise 8: Pen & Paper
Why does the same sentence cost more tokens in Burmese than English? Explain the chain from training corpus to merge rules to token inflation.
Exercise 9: Pen & Paper
Explain the glitch-token phenomenon. Why can a token exist in the vocabulary yet have an essentially random embedding?
Exercise 10: Pen & Paper
Trace how the vocabulary size V appears in (a) the embedding matrix, (b) the LM head, (c) the softmax cost, (d) the sequence length. Why is V a load-bearing choice?
Exercise 11: Code
Implement BPE training from scratch (get_pair_counts, merge_pair, train_bpe). Train on a paragraph of text and print the first 20 learned merges.
Exercise 12: Code
Implement BPE encoding using your learned merges. Encode words that did and did not appear in training; confirm unseen words still tokenize.
Exercise 13: Code
Use tiktoken to compare token counts for the same passage translated into 5 languages. Compute and plot the token-inflation factor relative to English.
Exercise 14: Code Lab
Compare BPE, WordPiece (BERT), and Unigram (T5) on the same text using Hugging Face tokenizers. Tabulate token counts and show where the segmentations differ.
Exercise 15: Code
Demonstrate the number-tokenization problem: tokenize 0–1000 with the GPT-2 tokenizer and plot tokens-per-number. Compare with a per-digit scheme.
Exercise 16: Code
Use apply_chat_template on an instruction-tuned model. Show the exact special tokens it inserts, then demonstrate how a hand-built (wrong) template differs.
Exercise 17: Code
Audit a tokenizer for potential glitch tokens: load a model's embedding matrix, compute per-token embedding norms, and list the tokens with anomalously small norms.
Exercise 18: Code (Challenge)
Train a complete byte-level BPE tokenizer on a small multilingual corpus (English + one non-Latin-script language). Measure the token-inflation factor between the two languages, then retrain with a larger vocabulary and show the factor shrink.

Further reading: “Neural Machine Translation of Rare Words with Subword Units” (Sennrich et al., 2016) — BPE for NLP. “Subword Regularization” and “SentencePiece” (Kudo, 2018; Kudo & Richardson, 2018) for Unigram and the library. “Language Model Tokenizers Introduce Unfairness Between Languages” (Petrov et al., 2023). Andrej Karpathy's “Let's build the GPT Tokenizer” video and his minbpe repository — the best hands-on companion. The tiktoken and Hugging Face tokenizers documentation for production use.


Next → Chapter 15: Training Transformers

You now have a complete Transformer (Chapter 13) and know how to feed it text (this chapter). Chapter 15 covers how to actually train one: the next-token prediction objective at scale, learning-rate schedules with warmup and decay, the AdamW optimizer in practice, gradient accumulation and clipping, mixed-precision training, and the practical recipe for a stable training run. We bring together the optimization of Chapter 2, the numerical stability of Chapter 5, and the architecture of Chapter 13 into a working training pipeline.

18 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →