Solutions Appendix
Chapter 12

Attention Mechanisms

20 Solutions

Detailed solutions for the exercises in Chapter 12. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Explain attention as a soft dictionary lookup. What makes the soft version differentiable?

Solution

A hard dictionary returns the value whose key exactly matches the query — a discrete argmax, which has zero gradient almost everywhere. Attention replaces the hard match with a softmax over key–query similarities, returning a weighted average of all values. Because softmax is smooth, the output varies continuously with the queries and keys, so gradients flow and the lookup is learnable — a differentiable, content-based retrieval.

Exercise 2Derive
Show Var(q·k)=d for unit-variance components; justify √d scaling.

Solution

q·k = Σ_{i=1}^d q_i k_i. With independent, zero-mean, unit-variance components, each product has variance 1 and they are independent, so Var(q·k) = d. The dot product's std therefore grows like √d, pushing softmax inputs into saturation as d grows. Dividing by √d rescales the variance back to 1, keeping the softmax in a well-conditioned range — the reason for scaled dot-product attention.

Exercise 3Pen & Paper
Without √d scaling, what happens to softmax for large d?

Solution

The dot products have std ∝ √d, so for large d the logits become large in magnitude. Softmax of large, spread-out logits becomes nearly one-hot (almost all weight on the single largest score), making attention overly peaked. The gradient of a saturated softmax is tiny, so learning stalls. The √d division prevents this saturation, keeping attention distributions and gradients healthy.

Exercise 4Pen & Paper
Why separate keys from values? Example where matching ≠ content representation.

Solution

Keys decide WHERE to attend; values decide WHAT is retrieved. Separating them lets a token be matched on one basis but contribute different information. Example: in coreference, the key of 'it' should match on grammatical/positional cues to find its antecedent 'the dog', but the value retrieved should be the semantic content of 'dog'. Forcing keys = values would couple 'how I'm found' with 'what I provide', losing flexibility.

Exercise 5Pen & Paper
Write the causal mask for T=4; show −∞ scores get zero softmax weight.

Solution

The causal mask is lower-triangular: position i may attend to j only if j ≤ i. For T=4 the allowed (1s) form a lower triangle; the upper triangle (future) is set to −∞ before softmax. Since e^{−∞} = 0, those entries contribute exactly zero weight after normalization, so each position attends only to itself and the past — enforcing autoregressive causality.

Exercise 6Pen & Paper
Compare score-matrix shapes for self- vs cross-attention (target 10, source 15).

Solution

Self-attention on the 10-token target produces a 10×10 score matrix (each target token attends to all target tokens). Cross-attention has the 10 target queries attend to the 15 source keys, giving a 10×15 matrix. The difference: self-attention's keys/values come from the same sequence as the queries; cross-attention's come from a different (source) sequence, so the matrix is rectangular.

Exercise 7Derive
Show H heads of dim d/H keep total compute ≈ single-head at dim d.

Solution

Each head operates on dimension d/H, so its QKᵀ and attention·V cost ∝ T²·(d/H). Summing over H heads gives H·T²·(d/H) = T²·d — the same as a single head at full dimension d. Multi-head attention thus buys multiple representation subspaces at no extra asymptotic cost; the projections add only O(T·d²), unchanged by H.

Exercise 8Pen & Paper
What does an induction head compute? Sketch its pattern for A B C A.

Solution

An induction head implements the rule 'if the current token X previously appeared followed by Y, predict Y next'. For ...A B C A, the head at the second A attends back to the first A, then shifts to the token that followed it (B), copying B as the prediction. This prefix-matching/copying circuit is thought to underlie much of in-context learning — the model generalizes patterns it sees within the prompt.

Exercise 9Pen & Paper
Attention is O(T²). Compute the FLOP ratio for T=2048 vs 32768.

Solution

Cost scales as T², so the ratio is (32768/2048)² = 16² = 256×. A 16× longer context costs 256× more attention compute (and memory for the score matrix). This quadratic blow-up is why long context is expensive and motivates the efficient-attention and state-space methods of Chapter 33.

Exercise 10Pen & Paper
Encoder vs decoder: the one line that differs. Why does generation need the mask?

Solution

The difference is whether the causal mask is applied to the attention scores: the decoder adds the lower-triangular −∞ mask; the encoder does not (full bidirectional attention). Generation needs the mask because at training time we predict each token from only its predecessors; if a token could attend to the future, it would 'cheat' by seeing the answer, and the model would fail at inference where the future doesn't exist yet.

Exercise 11Code
Implement scaled dot-product attention; verify rows sum to 1; compare entropy with/without √d.

Solution

The implementation computes softmax(QKᵀ/√d)V; each attention row sums to 1 (softmax). Measuring the entropy of the attention distribution shows the unscaled version (no √d) is far more peaked (lower entropy, near one-hot) for d=64, while the scaled version stays smoother — the empirical payoff of Exercises 2–3.

Exercise 12Code
Implement self-attention as a class; show identical tokens get different outputs from context.

Solution

Feeding a sequence where the same token appears in two different contexts, the self-attention outputs for those positions differ, because each token's representation is a context-weighted mix of its neighbors. This demonstrates contextualization — the property static embeddings (Chapter 8) lack and attention provides.

Exercise 13Code
Implement causal masking; visualize the attention matrix for 10 tokens; confirm lower-triangular.

Solution

The heatmap shows nonzero weights only on and below the diagonal — each token attends to itself and earlier tokens, with the upper triangle exactly zero. This visually confirms the mask correctly enforces causality.

Exercise 14Code Lab
Implement multi-head attention with head-splitting; verify shapes for d=512,H=8,T=16.

Solution

Reshaping (T,512) into (T,8,64), attending per head, and concatenating back to (T,512) reproduces the shape trace: Q,K,V become (8,16,64), scores (8,16,16), output (16,512). Verifying each intermediate shape confirms the head-splitting reshape is correct — the most error-prone part of implementing MHA.

Exercise 15Code
Implement cross-attention; visualize a (5,7) alignment matrix for a toy translation.

Solution

With a 5-token target attending to a 7-token source, the (5,7) attention matrix shows which source tokens each target token draws from; for an aligned task it is roughly diagonal, with off-diagonal mass where reordering occurs — the alignment learned by cross-attention.

Exercise 16Code
Compare your attention to F.scaled_dot_product_attention; verify to 1e-5.

Solution

Running identical inputs through your implementation and PyTorch's fused kernel yields outputs matching to ~1e−5, confirming correctness (and that the fused version is purely an efficiency optimization, not a different computation).

Exercise 17Code Lab
Load distilgpt2; visualize attention heads; identify a previous-token head and a delimiter head.

Solution

Extracting and plotting head attention maps reveals specialized patterns: a 'previous-token' head with weight concentrated just below the diagonal (attending to position i−1), and a 'delimiter' head that attends to punctuation/separator tokens. This shows real models learn interpretable, specialized attention circuits.

Exercise 18Code
Implement the KV-cache so each new token costs O(T) not O(T²); measure speedup.

Solution

Caching past keys and values means generating token t only computes attention between the new query and the cached T keys/values — O(T) per step instead of recomputing the full O(T²) attention. Measured over 100+ tokens this gives a large speedup that grows with sequence length, while producing identical outputs.

Exercise 19Code
Measure attention's quadratic scaling: time forward passes for T=128..2048; confirm O(T²).

Solution

Plotting forward-pass time against T on log-log axes yields a slope of ~2 (each doubling of T roughly quadruples the time), empirically confirming the O(T²) cost and motivating long-context techniques.

Exercise 20Code (Challenge)
Implement MHA with backward via your autograd; verify vs PyTorch; assemble into a Pre-LN block.

Solution

Building multi-head attention on the Chapter-11 autograd, checking gradients against PyTorch (~1e−5), and wiring it with LayerNorm, residuals, and an FFN into the Pre-LN block yields a complete, differentiable Transformer layer built from scratch — the culmination of the attention chapter.