Attention Mechanisms
Detailed solutions for the exercises in Chapter 12. Try solving them yourself before checking the answers.
Solution
A hard dictionary returns the value whose key exactly matches the query — a discrete argmax, which has zero gradient almost everywhere. Attention replaces the hard match with a softmax over key–query similarities, returning a weighted average of all values. Because softmax is smooth, the output varies continuously with the queries and keys, so gradients flow and the lookup is learnable — a differentiable, content-based retrieval.
Solution
q·k = Σ_{i=1}^d q_i k_i. With independent, zero-mean, unit-variance components, each product has variance 1 and they are independent, so Var(q·k) = d. The dot product's std therefore grows like √d, pushing softmax inputs into saturation as d grows. Dividing by √d rescales the variance back to 1, keeping the softmax in a well-conditioned range — the reason for scaled dot-product attention.
Solution
The dot products have std ∝ √d, so for large d the logits become large in magnitude. Softmax of large, spread-out logits becomes nearly one-hot (almost all weight on the single largest score), making attention overly peaked. The gradient of a saturated softmax is tiny, so learning stalls. The √d division prevents this saturation, keeping attention distributions and gradients healthy.
Solution
Keys decide WHERE to attend; values decide WHAT is retrieved. Separating them lets a token be matched on one basis but contribute different information. Example: in coreference, the key of 'it' should match on grammatical/positional cues to find its antecedent 'the dog', but the value retrieved should be the semantic content of 'dog'. Forcing keys = values would couple 'how I'm found' with 'what I provide', losing flexibility.
Solution
The causal mask is lower-triangular: position i may attend to j only if j ≤ i. For T=4 the allowed (1s) form a lower triangle; the upper triangle (future) is set to −∞ before softmax. Since e^{−∞} = 0, those entries contribute exactly zero weight after normalization, so each position attends only to itself and the past — enforcing autoregressive causality.
Solution
Self-attention on the 10-token target produces a 10×10 score matrix (each target token attends to all target tokens). Cross-attention has the 10 target queries attend to the 15 source keys, giving a 10×15 matrix. The difference: self-attention's keys/values come from the same sequence as the queries; cross-attention's come from a different (source) sequence, so the matrix is rectangular.
Solution
Each head operates on dimension d/H, so its QKᵀ and attention·V cost ∝ T²·(d/H). Summing over H heads gives H·T²·(d/H) = T²·d — the same as a single head at full dimension d. Multi-head attention thus buys multiple representation subspaces at no extra asymptotic cost; the projections add only O(T·d²), unchanged by H.
Solution
An induction head implements the rule 'if the current token X previously appeared followed by Y, predict Y next'. For ...A B C A, the head at the second A attends back to the first A, then shifts to the token that followed it (B), copying B as the prediction. This prefix-matching/copying circuit is thought to underlie much of in-context learning — the model generalizes patterns it sees within the prompt.
Solution
Cost scales as T², so the ratio is (32768/2048)² = 16² = 256×. A 16× longer context costs 256× more attention compute (and memory for the score matrix). This quadratic blow-up is why long context is expensive and motivates the efficient-attention and state-space methods of Chapter 33.
Solution
The difference is whether the causal mask is applied to the attention scores: the decoder adds the lower-triangular −∞ mask; the encoder does not (full bidirectional attention). Generation needs the mask because at training time we predict each token from only its predecessors; if a token could attend to the future, it would 'cheat' by seeing the answer, and the model would fail at inference where the future doesn't exist yet.
Solution
The implementation computes softmax(QKᵀ/√d)V; each attention row sums to 1 (softmax). Measuring the entropy of the attention distribution shows the unscaled version (no √d) is far more peaked (lower entropy, near one-hot) for d=64, while the scaled version stays smoother — the empirical payoff of Exercises 2–3.
Solution
Feeding a sequence where the same token appears in two different contexts, the self-attention outputs for those positions differ, because each token's representation is a context-weighted mix of its neighbors. This demonstrates contextualization — the property static embeddings (Chapter 8) lack and attention provides.
Solution
The heatmap shows nonzero weights only on and below the diagonal — each token attends to itself and earlier tokens, with the upper triangle exactly zero. This visually confirms the mask correctly enforces causality.
Solution
Reshaping (T,512) into (T,8,64), attending per head, and concatenating back to (T,512) reproduces the shape trace: Q,K,V become (8,16,64), scores (8,16,16), output (16,512). Verifying each intermediate shape confirms the head-splitting reshape is correct — the most error-prone part of implementing MHA.
Solution
With a 5-token target attending to a 7-token source, the (5,7) attention matrix shows which source tokens each target token draws from; for an aligned task it is roughly diagonal, with off-diagonal mass where reordering occurs — the alignment learned by cross-attention.
Solution
Running identical inputs through your implementation and PyTorch's fused kernel yields outputs matching to ~1e−5, confirming correctness (and that the fused version is purely an efficiency optimization, not a different computation).
Solution
Extracting and plotting head attention maps reveals specialized patterns: a 'previous-token' head with weight concentrated just below the diagonal (attending to position i−1), and a 'delimiter' head that attends to punctuation/separator tokens. This shows real models learn interpretable, specialized attention circuits.
Solution
Caching past keys and values means generating token t only computes attention between the new query and the cached T keys/values — O(T) per step instead of recomputing the full O(T²) attention. Measured over 100+ tokens this gives a large speedup that grows with sequence length, while producing identical outputs.
Solution
Plotting forward-pass time against T on log-log axes yields a slope of ~2 (each doubling of T roughly quadruples the time), empirically confirming the O(T²) cost and motivating long-context techniques.
Solution
Building multi-head attention on the Chapter-11 autograd, checking gradients against PyTorch (~1e−5), and wiring it with LayerNorm, residuals, and an FFN into the Pre-LN block yields a complete, differentiable Transformer layer built from scratch — the culmination of the attention chapter.