Solutions Appendix
Chapter 19

Architecture Variants

20 Solutions

Detailed solutions for the exercises in Chapter 19. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Compare encoder-only, decoder-only, encoder-decoder; why did decoder-only win for generative LLMs?

Solution

Encoder-only (BERT): bidirectional, masked-LM objective, best for classification/understanding. Decoder-only (GPT): causal mask, next-token objective, best for generation. Encoder-decoder (T5): bidirectional encoder + causal decoder with cross-attention, best for seq2seq. Decoder-only won for generative LLMs because next-token prediction is a single universal objective that scales, naturally supports open-ended generation and in-context learning, and recasts every task as text continuation — maximal simplicity and generality.

Exercise 2Derive
Show RoPE makes q·k depend only on (m−n), using R(mθ)ᵀR(nθ)=R((n−m)θ).

Solution

RoPE rotates the query at position m by R(mθ) and the key at position n by R(nθ). Their dot product is (R(mθ)q)ᵀ(R(nθ)k) = qᵀR(mθ)ᵀR(nθ)k = qᵀR((n−m)θ)k, since rotation matrices satisfy R(a)ᵀR(b) = R(b−a). The score therefore depends only on the relative offset n−m, not the absolute positions — RoPE encodes relative position through rotation, with no learned position parameters.

Exercise 3Pen & Paper
Why can't learned absolute positions handle longer-than-training sequences, while RoPE/ALiBi can?

Solution

Learned absolute embeddings have one trained vector per position index, so positions beyond the training length simply have no embedding (and the model never learned them) — it breaks. RoPE applies a deterministic rotation defined for any position, and ALiBi adds a deterministic distance penalty for any gap, so both are defined at unseen positions and degrade gracefully — enabling length extrapolation (the springboard for Chapter 33).

Exercise 4Pen & Paper
Describe ALiBi's linear bias; how does penalizing distant tokens enable length extrapolation?

Solution

ALiBi adds a bias to each attention score that decreases linearly with the distance between query and key (with a per-head slope), so far-apart tokens are penalized more. Because this bias is a simple function of distance — not a learned per-position vector — it is defined for ANY distance, including gaps larger than seen in training. The model's locality preference learned at short lengths transfers to longer ones, giving extrapolation.

Exercise 5Pen & Paper
List the 'LLaMA recipe' components; for each, the old default it replaced and the problem addressed.

Solution

RoPE replaces learned absolute positions (fixes length extrapolation). RMSNorm replaces LayerNorm (cheaper, no mean subtraction). SwiGLU replaces the GELU FFN (better quality per parameter). Pre-norm replaces post-norm (training stability). No bias terms replaces biased linears (negligible cost, slight stability/throughput gain). Grouped-query attention replaces full MHA in later versions (smaller KV cache). Each swap targets a concrete efficiency or stability problem of the original Transformer.

Exercise 6Derive
Derive the KV-cache size; for 13B (40 layers, 40 heads, 128 dim), 16k context, batch 16, bf16, in GB.

Solution

Cache = 2 (K and V) × layers × heads × head_dim × seq_len × batch × bytes.

2 × 40 × 40 × 128 × 16384 × 16 × 2 bytes ≈ 2.15×10¹¹ ≈ 215 GB

The KV cache (≈215 GB here) dwarfs the model weights at long context and large batch — which is exactly why GQA/MQA (Exercise 8) and KV-cache compression matter so much for serving.

Exercise 7Pen & Paper
Why is generation memory-bandwidth-bound while training is compute-bound? Why does the KV cache bottleneck throughput?

Solution

Training processes many tokens in parallel with large matmuls — lots of compute per byte moved (high arithmetic intensity), so it is compute-bound. Autoregressive generation produces ONE token at a time, so each step does little compute but must read all the weights and the entire KV cache from memory — low arithmetic intensity, memory-bandwidth-bound. A large KV cache means more bytes to read per generated token, directly throttling generation throughput.

Exercise 8Pen & Paper
Compare MHA, GQA, MQA in KV-cache size and quality; cache reduction for 64 heads, 8 groups.

Solution

MHA has a separate K/V per query head (largest cache, best quality). MQA shares ONE K/V across all query heads (smallest cache, some quality loss). GQA groups query heads to share K/V per group — a middle ground. With 64 query heads and 8 groups, GQA stores K/V for 8 heads instead of 64, an 8× cache reduction, while retaining most of MHA's quality — why GQA is the modern default.

Exercise 9Pen & Paper
Why is FlashAttention faster with the same O(T²) FLOPs? What does it reduce?

Solution

FlashAttention computes identical attention but never materializes the full T×T score matrix in slow HBM. It tiles the computation, keeping blocks in fast on-chip SRAM and using the online-softmax trick to combine blocks. This slashes HBM memory TRAFFIC (the actual bottleneck, since attention is memory-bound), so it runs much faster and uses far less memory despite the same arithmetic — an IO-aware algorithm.

Exercise 10Pen & Paper
For sliding-window attention (window w, L layers), derive the effective receptive field; why can it still propagate long-range info?

Solution

Each layer lets a token attend w positions back, so after L stacked layers information can propagate up to ≈ w×L positions (each layer extends reach by w, like a CNN's growing receptive field). So even though any single layer is local, the stack achieves a long effective receptive field, letting distant tokens influence each other indirectly through intermediate positions — at far lower cost than full attention.

Exercise 11Code
Implement RoPE (cos/sin tables + rotation); verify attention score depends only on position difference.

Solution

Precompute cos/sin of position×frequency, rotate Q and K accordingly, and confirm that scores between positions m and n depend only on m−n (Exercise 2) — e.g. shifting both positions by a constant leaves the score unchanged. This validates the relative-position property in code.

Exercise 12Code
Implement ALiBi; compare attention distributions to no positional encoding.

Solution

Adding the per-head linear distance bias to scores produces attention that decays with distance (local focus), whereas no positional encoding gives position-agnostic attention. Comparing the distributions shows ALiBi's built-in recency/locality bias — the mechanism behind its extrapolation (Exercise 4).

Exercise 13Code
Implement RMSNorm and SwiGLU; verify the SwiGLU FFN matches a 4d GELU FFN at hidden=(8/3)d.

Solution

RMSNorm divides by the root-mean-square (no mean subtraction); SwiGLU gates one projection by the SiLU of another. Setting the SwiGLU hidden width to (8/3)d makes its three matrices total the same parameter count as the two-matrix 4d GELU FFN — confirming the parameter-match of Chapter 10's Exercise 8 in a built block.

Exercise 14Code Lab
Implement GQA with configurable groups; measure KV-cache size for MHA, GQA-8, MQA; confirm reductions.

Solution

Implementing grouped K/V projections and measuring cache size confirms MHA (full), GQA-8 (8× smaller for 64 heads), and MQA (heads× smaller) follow the expected ratios of Exercise 8 — a direct demonstration of the memory/quality dial.

Exercise 15Code
Implement the KV-cache; measure memory growth with length; switch to GQA and show the expected shrink.

Solution

Caching K/V during generation shows memory growing linearly with sequence length (Exercise 6). Switching the model to GQA reduces the cache by the group factor, visibly lowering the memory-vs-length curve — the practical payoff for long-context serving.

Exercise 16Code
Compare naive attention to F.scaled_dot_product_attention (FlashAttention) at T=4096; verify and measure.

Solution

The fused FlashAttention kernel produces identical outputs (to ~1e−5) but at T=4096 runs faster and uses dramatically less peak memory than the naive implementation that materializes the score matrix — confirming Exercise 9's IO-traffic argument empirically.

Exercise 17Code
Implement a sliding-window attention mask; verify the receptive field grows with stacked layers.

Solution

Masking attention to a local window and stacking layers on a synthetic propagation task shows information reaching tokens up to ≈ w×L away (Exercise 10) — demonstrating that local attention plus depth still captures long-range dependencies.

Exercise 18Code Lab
Build a top-1 MoE FFN over 4 experts; verify each token uses one expert; measure active vs total params.

Solution

A top-1 router sends each token to a single expert, so only one expert's parameters are active per token while all four exist in memory. Measuring active (≈¼ of total FFN) vs total parameters previews the MoE capacity/compute decoupling explored fully in Chapter 32.

Exercise 19Code
Assemble a modern decoder block (RMSNorm + RoPE + GQA + SwiGLU, no biases); compare param count to the Ch13 baseline.

Solution

Wiring the modern components into one block and verifying shape flow, then comparing parameter counts, shows the modern block achieves similar or better capacity with efficiency wins (smaller KV cache via GQA, cheaper norm) — the consolidated 'modern recipe' of Exercise 5.

Exercise 20Code (Challenge)
Modernize a GPT-2-style model (RoPE, RMSNorm, SwiGLU, GQA); train both and compare perplexity, memory, speed.

Solution

Swapping in the modern components and training alongside the baseline on the same data typically yields comparable or better perplexity with lower memory (RMSNorm, GQA cache) and faster generation — a hands-on demonstration of why the field migrated from the original Transformer to the LLaMA-style recipe.