Architecture Variants
Detailed solutions for the exercises in Chapter 19. Try solving them yourself before checking the answers.
Solution
Encoder-only (BERT): bidirectional, masked-LM objective, best for classification/understanding. Decoder-only (GPT): causal mask, next-token objective, best for generation. Encoder-decoder (T5): bidirectional encoder + causal decoder with cross-attention, best for seq2seq. Decoder-only won for generative LLMs because next-token prediction is a single universal objective that scales, naturally supports open-ended generation and in-context learning, and recasts every task as text continuation — maximal simplicity and generality.
Solution
RoPE rotates the query at position m by R(mθ) and the key at position n by R(nθ). Their dot product is (R(mθ)q)ᵀ(R(nθ)k) = qᵀR(mθ)ᵀR(nθ)k = qᵀR((n−m)θ)k, since rotation matrices satisfy R(a)ᵀR(b) = R(b−a). The score therefore depends only on the relative offset n−m, not the absolute positions — RoPE encodes relative position through rotation, with no learned position parameters.
Solution
Learned absolute embeddings have one trained vector per position index, so positions beyond the training length simply have no embedding (and the model never learned them) — it breaks. RoPE applies a deterministic rotation defined for any position, and ALiBi adds a deterministic distance penalty for any gap, so both are defined at unseen positions and degrade gracefully — enabling length extrapolation (the springboard for Chapter 33).
Solution
ALiBi adds a bias to each attention score that decreases linearly with the distance between query and key (with a per-head slope), so far-apart tokens are penalized more. Because this bias is a simple function of distance — not a learned per-position vector — it is defined for ANY distance, including gaps larger than seen in training. The model's locality preference learned at short lengths transfers to longer ones, giving extrapolation.
Solution
RoPE replaces learned absolute positions (fixes length extrapolation). RMSNorm replaces LayerNorm (cheaper, no mean subtraction). SwiGLU replaces the GELU FFN (better quality per parameter). Pre-norm replaces post-norm (training stability). No bias terms replaces biased linears (negligible cost, slight stability/throughput gain). Grouped-query attention replaces full MHA in later versions (smaller KV cache). Each swap targets a concrete efficiency or stability problem of the original Transformer.
Solution
Cache = 2 (K and V) × layers × heads × head_dim × seq_len × batch × bytes.
The KV cache (≈215 GB here) dwarfs the model weights at long context and large batch — which is exactly why GQA/MQA (Exercise 8) and KV-cache compression matter so much for serving.
Solution
Training processes many tokens in parallel with large matmuls — lots of compute per byte moved (high arithmetic intensity), so it is compute-bound. Autoregressive generation produces ONE token at a time, so each step does little compute but must read all the weights and the entire KV cache from memory — low arithmetic intensity, memory-bandwidth-bound. A large KV cache means more bytes to read per generated token, directly throttling generation throughput.
Solution
MHA has a separate K/V per query head (largest cache, best quality). MQA shares ONE K/V across all query heads (smallest cache, some quality loss). GQA groups query heads to share K/V per group — a middle ground. With 64 query heads and 8 groups, GQA stores K/V for 8 heads instead of 64, an 8× cache reduction, while retaining most of MHA's quality — why GQA is the modern default.
Solution
FlashAttention computes identical attention but never materializes the full T×T score matrix in slow HBM. It tiles the computation, keeping blocks in fast on-chip SRAM and using the online-softmax trick to combine blocks. This slashes HBM memory TRAFFIC (the actual bottleneck, since attention is memory-bound), so it runs much faster and uses far less memory despite the same arithmetic — an IO-aware algorithm.
Solution
Each layer lets a token attend w positions back, so after L stacked layers information can propagate up to ≈ w×L positions (each layer extends reach by w, like a CNN's growing receptive field). So even though any single layer is local, the stack achieves a long effective receptive field, letting distant tokens influence each other indirectly through intermediate positions — at far lower cost than full attention.
Solution
Precompute cos/sin of position×frequency, rotate Q and K accordingly, and confirm that scores between positions m and n depend only on m−n (Exercise 2) — e.g. shifting both positions by a constant leaves the score unchanged. This validates the relative-position property in code.
Solution
Adding the per-head linear distance bias to scores produces attention that decays with distance (local focus), whereas no positional encoding gives position-agnostic attention. Comparing the distributions shows ALiBi's built-in recency/locality bias — the mechanism behind its extrapolation (Exercise 4).
Solution
RMSNorm divides by the root-mean-square (no mean subtraction); SwiGLU gates one projection by the SiLU of another. Setting the SwiGLU hidden width to (8/3)d makes its three matrices total the same parameter count as the two-matrix 4d GELU FFN — confirming the parameter-match of Chapter 10's Exercise 8 in a built block.
Solution
Implementing grouped K/V projections and measuring cache size confirms MHA (full), GQA-8 (8× smaller for 64 heads), and MQA (heads× smaller) follow the expected ratios of Exercise 8 — a direct demonstration of the memory/quality dial.
Solution
Caching K/V during generation shows memory growing linearly with sequence length (Exercise 6). Switching the model to GQA reduces the cache by the group factor, visibly lowering the memory-vs-length curve — the practical payoff for long-context serving.
Solution
The fused FlashAttention kernel produces identical outputs (to ~1e−5) but at T=4096 runs faster and uses dramatically less peak memory than the naive implementation that materializes the score matrix — confirming Exercise 9's IO-traffic argument empirically.
Solution
Masking attention to a local window and stacking layers on a synthetic propagation task shows information reaching tokens up to ≈ w×L away (Exercise 10) — demonstrating that local attention plus depth still captures long-range dependencies.
Solution
A top-1 router sends each token to a single expert, so only one expert's parameters are active per token while all four exist in memory. Measuring active (≈¼ of total FFN) vs total parameters previews the MoE capacity/compute decoupling explored fully in Chapter 32.
Solution
Wiring the modern components into one block and verifying shape flow, then comparing parameter counts, shows the modern block achieves similar or better capacity with efficiency wins (smaller KV cache via GQA, cheaper norm) — the consolidated 'modern recipe' of Exercise 5.
Solution
Swapping in the modern components and training alongside the baseline on the same data typically yields comparable or better perplexity with lower memory (RMSNorm, GQA cache) and faster generation — a hands-on demonstration of why the field migrated from the original Transformer to the LLaMA-style recipe.