Solutions Appendix
Chapter 13

The Transformer Architecture

20 Solutions

Detailed solutions for the exercises in Chapter 13. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Prove self-attention without positional encoding is permutation-equivariant. Why does this require positional encoding?

Solution

If the input rows are permuted by a permutation matrix P, every query, key, and value is permuted identically, so the score matrix becomes P(scores)Pᵀ and the output is P(output) — the outputs are permuted the same way as the inputs, with values unchanged. So attention treats the input as a SET, blind to order. Since language is sequential ('dog bites man' ≠ 'man bites dog'), we must inject order information via positional encodings.

Exercise 2Derive
Show sinusoidal encoding lets the model attend to relative positions: PE(pos+k) as a linear function of PE(pos).

Solution

Each sinusoidal dimension pair is (sin(ωpos), cos(ωpos)). By the angle-addition formulas, sin(ω(pos+k)) = sin(ωpos)cos(ωk) + cos(ωpos)sin(ωk) and similarly for cosine. So PE(pos+k) = R_k·PE(pos), where R_k is a fixed rotation matrix depending only on the offset k. Because shifting position by k is a linear (rotation) map independent of pos, the model can learn to attend by relative offset — the key property that makes sinusoidal encodings generalize across positions.

Exercise 3Pen & Paper
Count parameters in a Transformer block (d=512, H=8, FFN=2048). FFN fraction?

Solution

Attention: four d×d projections (Q,K,V,O) = 4·512² ≈ 1.05M. FFN: two matrices 512×2048 and 2048×512 = 2·512·2048 ≈ 2.10M. Plus small LayerNorm params. The FFN holds about ⅔ of the block's parameters (≈2.1M of ≈3.15M) — which is why MoE (Chapter 32) targets the FFN for expansion.

Exercise 4Pen & Paper
Explain the residual-stream view. Why is Pre-LN cleaner than Post-LN?

Solution

The residual stream is a running sum that each sublayer reads from (via LayerNorm) and writes an increment back to: x ← x + Sublayer(LN(x)). Pre-LN normalizes the INPUT to each sublayer but leaves the residual stream itself un-normalized, so gradients flow through the identity path undistorted — stable even for very deep models. Post-LN (x ← LN(x + Sublayer(x))) normalizes the stream itself every layer, which can shrink/distort gradients and often requires careful warmup. Pre-LN's clean residual highway is why it dominates modern LLMs.

Exercise 5Pen & Paper
Estimate total params for a decoder-only model (V=50257, d=768, 12 layers). Where do most live?

Solution

Embeddings: V×d = 50257×768 ≈ 38.6M (tied for input and output). Each layer ≈ 12d² ≈ 7.1M (attention 4d² + FFN 8d²), times 12 ≈ 85M. Total ≈ 124M (this is GPT-2 small). Most parameters live in the 12 Transformer layers (≈85M), with the embedding/LM-head matrix the single largest single block (≈39M).

Exercise 6Pen & Paper
Explain weight tying. Parameters saved for V=50257, d=768? Justification?

Solution

Weight tying shares the input embedding matrix (V×d) with the output projection (the LM head), saving one V×d matrix = 50257×768 ≈ 38.6M parameters. Conceptually justified because both map between token identity and the d-dimensional space — the same semantic geometry should govern reading a token in and scoring it out — and empirically tying improves perplexity while shrinking the model.

Exercise 7Pen & Paper
Compare encoder-only, decoder-only, encoder-decoder. Why did decoder-only win for LLMs?

Solution

Encoder-only (BERT): bidirectional (no mask), masked-LM objective, good for understanding/classification. Decoder-only (GPT): causal mask, next-token objective, good for generation. Encoder-decoder (T5): bidirectional encoder + causal decoder with cross-attention, good for seq2seq. Decoder-only won for LLMs because next-token prediction is a simple, universal objective that scales, enables open-ended generation, supports in-context learning, and unifies all tasks as text continuation — one architecture for everything.

Exercise 8Derive
Implement RoPE on paper: show rotating q,k by angle ∝ position makes q·k depend on relative position.

Solution

RoPE rotates each 2-D sub-vector of q at position m by angle mθ and of k at position n by angle nθ. The dot product of two rotated 2-D vectors depends on the difference of their rotation angles: (R_{mθ}q)·(R_{nθ}k) = qᵀR_{(m−n)θ}k, a function of (m−n) only. Summing over sub-vectors, the attention score between positions m and n depends only on their relative offset m−n — giving relative-position awareness with no added parameters and (Chapter 33) extrapolation potential.

Exercise 9Pen & Paper
Why does the KV-cache work only because of the causal mask?

Solution

With causal attention, token t attends only to tokens ≤ t, so the keys and values of past tokens are FIXED — adding a new token never changes them. This lets you cache and reuse them, computing only the new token's K/V each step. Without the causal mask (bidirectional attention), adding a token could change what every position attends to, so past K/V would not be reusable and the cache would be invalid.

Exercise 10Pen & Paper
Model trains with low loss but generates poorly. Three architecture bugs and how to distinguish.

Solution

(1) Broken/absent causal mask — the model 'cheats' during training (low teacher-forced loss) but can't generate; detect by checking generation degrades while train loss is great, and by inspecting the mask. (2) Train/inference position-encoding mismatch — verify positions are applied identically in both paths. (3) KV-cache bug — cached generation diverges from non-cached; detect by comparing cached vs full-recompute outputs (they must be identical). Distinguishing them: compare teacher-forced vs free-running outputs, and unit-test the mask, positions, and cache separately.

Exercise 11Code
Implement sinusoidal positional encoding; visualize as a heatmap; confirm frequency structure.

Solution

The heatmap (position × dimension) shows low dimensions oscillating rapidly with position and high dimensions varying slowly — a spectrum of frequencies. This multi-scale structure is what lets the model read both fine and coarse positional information, and underlies the relative-position property of Exercise 2.

Exercise 12Code
Compare sinusoidal, learned, and rotary encodings on a small task; which extrapolates best?

Solution

Learned absolute encodings fail beyond the trained length (no embedding exists for unseen positions). Sinusoidal generalizes somewhat. RoPE extrapolates best and, with the scaling tricks of Chapter 33, extends furthest — demonstrating why modern models adopt rotary embeddings.

Exercise 13Code
Implement the complete Pre-LN Transformer block from scratch; verify output shape = input shape.

Solution

Wiring LN→MHA→residual→LN→FFN→residual and confirming an (B,T,d) tensor exits with the same shape for any T validates the block — the reusable unit stacked to build the full model.

Exercise 14Code Lab
Build the complete GPT class; verify shapes end-to-end; confirm (T,V) logits.

Solution

Embedding + positional encoding + N Pre-LN blocks + final LayerNorm + LM head produces (T,V) logits for any token sequence. Verifying the shape flow end-to-end confirms the full decoder-only architecture is correctly assembled.

Exercise 15Code
Implement weight tying; verify param count drops by V×d and forward still works.

Solution

Sharing the embedding matrix with the LM head (using its transpose for the output projection) reduces parameters by V×d ≈ 38.6M and the forward pass still yields valid logits — confirming the saving of Exercise 6 with no loss of function.

Exercise 16Code
Implement greedy, temperature, top-k, nucleus sampling; confirm distributions differ on an untrained model.

Solution

On an untrained (near-uniform) model, greedy always picks the same token; temperature spreads the distribution; top-k restricts to the k most likely; nucleus keeps the smallest set summing to p. Sampling many continuations and inspecting the token distributions confirms each decoder behaves as designed (Chapter 3).

Exercise 17Code Lab
Implement the KV-cache for generation; measure speedup over recomputation for 100 tokens; confirm identical outputs.

Solution

Cached generation avoids recomputing attention over the whole prefix each step, giving a substantial speedup that grows with length, while producing token-for-token identical output to naive recomputation — validating the optimization of Exercise 12.18 in the full model.

Exercise 18Code
Convert the decoder to an encoder by removing the causal mask; verify full attention.

Solution

Dropping the mask lets every position attend to every other; inspecting the attention matrix shows it is now fully dense (not triangular). This single-line change turns a GPT-style decoder into a BERT-style encoder — illustrating how masking alone distinguishes the two.

Exercise 19Code
Load GPT-2 weights into a from-scratch model; verify logits match the reference to 1e-4.

Solution

Carefully mapping Hugging Face's parameter names/shapes into your implementation and running a fixed input should reproduce the reference logits to ~1e−4. Passing this gold-standard test proves your architecture is faithful down to every detail (layer order, LayerNorm placement, weight tying, scaling).

Exercise 20Code (Challenge)
Build a trainable nanoGPT: full forward+backward, train on Shakespeare, generate samples.

Solution

Implementing the complete model with backward (your autograd or PyTorch), training on a small corpus, and producing coherent (if small-scale) samples is the Part III capstone — a Transformer you built, trained, and sampled from end to end, embodying every concept from attention through the full architecture.