The Transformer Architecture
Detailed solutions for the exercises in Chapter 13. Try solving them yourself before checking the answers.
Solution
If the input rows are permuted by a permutation matrix P, every query, key, and value is permuted identically, so the score matrix becomes P(scores)Pᵀ and the output is P(output) — the outputs are permuted the same way as the inputs, with values unchanged. So attention treats the input as a SET, blind to order. Since language is sequential ('dog bites man' ≠ 'man bites dog'), we must inject order information via positional encodings.
Solution
Each sinusoidal dimension pair is (sin(ωpos), cos(ωpos)). By the angle-addition formulas, sin(ω(pos+k)) = sin(ωpos)cos(ωk) + cos(ωpos)sin(ωk) and similarly for cosine. So PE(pos+k) = R_k·PE(pos), where R_k is a fixed rotation matrix depending only on the offset k. Because shifting position by k is a linear (rotation) map independent of pos, the model can learn to attend by relative offset — the key property that makes sinusoidal encodings generalize across positions.
Solution
Attention: four d×d projections (Q,K,V,O) = 4·512² ≈ 1.05M. FFN: two matrices 512×2048 and 2048×512 = 2·512·2048 ≈ 2.10M. Plus small LayerNorm params. The FFN holds about ⅔ of the block's parameters (≈2.1M of ≈3.15M) — which is why MoE (Chapter 32) targets the FFN for expansion.
Solution
The residual stream is a running sum that each sublayer reads from (via LayerNorm) and writes an increment back to: x ← x + Sublayer(LN(x)). Pre-LN normalizes the INPUT to each sublayer but leaves the residual stream itself un-normalized, so gradients flow through the identity path undistorted — stable even for very deep models. Post-LN (x ← LN(x + Sublayer(x))) normalizes the stream itself every layer, which can shrink/distort gradients and often requires careful warmup. Pre-LN's clean residual highway is why it dominates modern LLMs.
Solution
Embeddings: V×d = 50257×768 ≈ 38.6M (tied for input and output). Each layer ≈ 12d² ≈ 7.1M (attention 4d² + FFN 8d²), times 12 ≈ 85M. Total ≈ 124M (this is GPT-2 small). Most parameters live in the 12 Transformer layers (≈85M), with the embedding/LM-head matrix the single largest single block (≈39M).
Solution
Weight tying shares the input embedding matrix (V×d) with the output projection (the LM head), saving one V×d matrix = 50257×768 ≈ 38.6M parameters. Conceptually justified because both map between token identity and the d-dimensional space — the same semantic geometry should govern reading a token in and scoring it out — and empirically tying improves perplexity while shrinking the model.
Solution
Encoder-only (BERT): bidirectional (no mask), masked-LM objective, good for understanding/classification. Decoder-only (GPT): causal mask, next-token objective, good for generation. Encoder-decoder (T5): bidirectional encoder + causal decoder with cross-attention, good for seq2seq. Decoder-only won for LLMs because next-token prediction is a simple, universal objective that scales, enables open-ended generation, supports in-context learning, and unifies all tasks as text continuation — one architecture for everything.
Solution
RoPE rotates each 2-D sub-vector of q at position m by angle mθ and of k at position n by angle nθ. The dot product of two rotated 2-D vectors depends on the difference of their rotation angles: (R_{mθ}q)·(R_{nθ}k) = qᵀR_{(m−n)θ}k, a function of (m−n) only. Summing over sub-vectors, the attention score between positions m and n depends only on their relative offset m−n — giving relative-position awareness with no added parameters and (Chapter 33) extrapolation potential.
Solution
With causal attention, token t attends only to tokens ≤ t, so the keys and values of past tokens are FIXED — adding a new token never changes them. This lets you cache and reuse them, computing only the new token's K/V each step. Without the causal mask (bidirectional attention), adding a token could change what every position attends to, so past K/V would not be reusable and the cache would be invalid.
Solution
(1) Broken/absent causal mask — the model 'cheats' during training (low teacher-forced loss) but can't generate; detect by checking generation degrades while train loss is great, and by inspecting the mask. (2) Train/inference position-encoding mismatch — verify positions are applied identically in both paths. (3) KV-cache bug — cached generation diverges from non-cached; detect by comparing cached vs full-recompute outputs (they must be identical). Distinguishing them: compare teacher-forced vs free-running outputs, and unit-test the mask, positions, and cache separately.
Solution
The heatmap (position × dimension) shows low dimensions oscillating rapidly with position and high dimensions varying slowly — a spectrum of frequencies. This multi-scale structure is what lets the model read both fine and coarse positional information, and underlies the relative-position property of Exercise 2.
Solution
Learned absolute encodings fail beyond the trained length (no embedding exists for unseen positions). Sinusoidal generalizes somewhat. RoPE extrapolates best and, with the scaling tricks of Chapter 33, extends furthest — demonstrating why modern models adopt rotary embeddings.
Solution
Wiring LN→MHA→residual→LN→FFN→residual and confirming an (B,T,d) tensor exits with the same shape for any T validates the block — the reusable unit stacked to build the full model.
Solution
Embedding + positional encoding + N Pre-LN blocks + final LayerNorm + LM head produces (T,V) logits for any token sequence. Verifying the shape flow end-to-end confirms the full decoder-only architecture is correctly assembled.
Solution
Sharing the embedding matrix with the LM head (using its transpose for the output projection) reduces parameters by V×d ≈ 38.6M and the forward pass still yields valid logits — confirming the saving of Exercise 6 with no loss of function.
Solution
On an untrained (near-uniform) model, greedy always picks the same token; temperature spreads the distribution; top-k restricts to the k most likely; nucleus keeps the smallest set summing to p. Sampling many continuations and inspecting the token distributions confirms each decoder behaves as designed (Chapter 3).
Solution
Cached generation avoids recomputing attention over the whole prefix each step, giving a substantial speedup that grows with length, while producing token-for-token identical output to naive recomputation — validating the optimization of Exercise 12.18 in the full model.
Solution
Dropping the mask lets every position attend to every other; inspecting the attention matrix shows it is now fully dense (not triangular). This single-line change turns a GPT-style decoder into a BERT-style encoder — illustrating how masking alone distinguishes the two.
Solution
Carefully mapping Hugging Face's parameter names/shapes into your implementation and running a fixed input should reproduce the reference logits to ~1e−4. Passing this gold-standard test proves your architecture is faithful down to every detail (layer order, LayerNorm placement, weight tying, scaling).
Solution
Implementing the complete model with backward (your autograd or PyTorch), training on a small corpus, and producing coherent (if small-scale) samples is the Part III capstone — a Transformer you built, trained, and sampled from end to end, embodying every concept from attention through the full architecture.