Solutions Appendix
Chapter 11

Backpropagation in Depth

18 Solutions

Detailed solutions for the exercises in Chapter 11. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Draw the graph for f(x,y)=(x+y)·(x·y); label nodes and local gradients.

Solution

Nodes: a = x+y (local grads ∂a/∂x = ∂a/∂y = 1); b = x·y (∂b/∂x = y, ∂b/∂y = x); f = a·b (∂f/∂a = b, ∂f/∂b = a). Backprop multiplies and accumulates these local gradients along each path from f back to x and y.

Exercise 2Derive
Compute ∂f/∂x, ∂f/∂y at x=2,y=3 for f=(x+y)(xy); verify directly.

Solution

Via the graph with a=5, b=6: ∂f/∂x = b·(∂a/∂x) + a·(∂b/∂x) = 6·1 + 5·y = 6 + 15 = 21. ∂f/∂y = 6·1 + 5·x = 6 + 10 = 16. Direct check: f = (x+y)(xy) = x²y + xy², so f_x = 2xy + y² = 12+9 = 21 and f_y = x² + 2xy = 4+12 = 16. ✓

Exercise 3Pen & Paper
Why does backprop use gradient accumulation (+=)? Example where = fails.

Solution

When a variable feeds multiple downstream operations, the total derivative is the SUM of contributions along all paths (multivariate chain rule). For f = x + x, x feeds the add twice; the correct ∂f/∂x = 1 + 1 = 2. Using assignment (=) would overwrite the first path's gradient with the second, yielding 1 — wrong. Accumulation (+=) correctly sums both paths.

Exercise 4Derive
Derive the matmul VJP: ∂L/∂X = (∂L/∂Y)Wᵀ and ∂L/∂W = Xᵀ(∂L/∂Y).

Solution

For Y = XW with X (n×d), W (d×m), Y (n×m), and upstream G = ∂L/∂Y (n×m): differentiating L through Y gives ∂L/∂X = G Wᵀ (shapes n×m · m×d = n×d, matching X) and ∂L/∂W = Xᵀ G (d×n · n×m = d×m, matching W). The transposes are forced by shape-matching, which is the quickest way to remember the matmul backward.

Exercise 5Derive
Derive the softmax VJP ∂L/∂z = p⊙(∂L/∂p − Σ(∂L/∂p⊙p)).

Solution

Using the Jacobian ∂p_i/∂z_j = p_i(δ_{ij} − p_j), the VJP is ∂L/∂z_j = Σ_i (∂L/∂p_i)·p_i(δ_{ij}−p_j) = p_j(∂L/∂p_j) − p_j·Σ_i(∂L/∂p_i)p_i. In vector form ∂L/∂z = p ⊙ (∂L/∂p − Σ_i(∂L/∂p_i)p_i). The subtracted scalar is the mean (under p) of the upstream gradient — the softmax 'centering' that also appears in the clean p−y form for cross-entropy.

Exercise 6Pen & Paper
Cost of a full Jacobian for a 1000×1000 matmul vs its VJP. Why never materialize the Jacobian?

Solution

The Jacobian of a 1000×1000→output map has on the order of (10⁶)×(10⁶) = 10¹² entries — enormous to store and form. The VJP computes (∂L/∂Y)Wᵀ directly as a matmul of the same size as the forward op (≈10⁹ FLOPs), never building the full Jacobian. Reverse-mode AD always propagates vector–Jacobian products, so it costs about the same as the forward pass and avoids the quadratic blowup.

Exercise 7Pen & Paper
Forward- vs reverse-mode AD cost for n=10⁹ params and one scalar loss.

Solution

Forward-mode propagates derivatives w.r.t. one input at a time, so getting all n=10⁹ parameter gradients costs ~n forward passes — hopeless. Reverse-mode propagates derivatives of the single scalar loss backward, getting ALL n gradients in one backward pass (cost ≈ one forward pass). With one output and billions of inputs, reverse mode is ~10⁹× cheaper — which is why all deep learning uses it.

Exercise 8Derive
Derive the LayerNorm backward, accounting for μ and σ² depending on every x element.

Solution

With μ and σ² functions of all inputs, x̂ = (x−μ)/√(σ²+ε), the gradient must route through three paths (direct, through μ, through σ²). The result is ∂L/∂x_i = (1/√(σ²+ε))·[g_i − mean(g) − x̂_i·mean(g⊙x̂)], where g = ∂L/∂x̂ = upstream·γ. The two subtracted means are exactly the corrections for μ and σ² depending on every element — the standard fused LayerNorm-backward formula.

Exercise 9Pen & Paper
Why visit nodes in reverse topological order? Small graph where wrong order fails.

Solution

A node's gradient is only complete once ALL its downstream consumers have contributed. Reverse topological order guarantees every consumer is processed before the node itself. Counterexample: in a→b, a→c, b→d, c→d, if you compute ∂L/∂a before finishing both b and c, you'd use an incomplete accumulation and get the wrong gradient. Reverse-topo order ensures correctness.

Exercise 10Pen & Paper
Why does checkpointing every √L layers give O(√L) memory? Compute overhead?

Solution

Store activations only at every √L-th layer (√L checkpoints). During backward, recompute the √L activations within each segment on demand — needing memory for only one segment (√L layers) at a time, plus the √L checkpoints: O(√L) total instead of O(L). The cost is one extra forward pass over each segment, i.e. roughly one additional forward pass overall (≈33% more compute), trading compute for a quadratic memory saving.

Exercise 11Code
Implement the scalar Value autograd engine; add __pow__, __neg__, tanh; verify vs finite diff.

Solution

Each operation records its inputs and a local backward closure that accumulates gradients (+=). For x^n the local grad is n·x^{n−1}; for −x it is −1; for tanh it is 1−tanh². A topological-sort backward then computes all gradients, which match finite differences to ~1e−5 — a working micro-autograd.

Exercise 12Code
Use the Value engine to train a 2→4→1 MLP on XOR to near-zero loss.

Solution

Building the MLP from scalar Value nodes and running gradient descent with the engine's backward drives XOR loss to ~0, proving the hand-built autograd can train a real (if tiny) network end to end.

Exercise 13Code
Implement a Tensor autograd with matmul, broadcast-add, ReLU, sum; verify grad-shape invariant.

Solution

Each tensor op stores a backward that produces gradients matching its inputs' shapes (summing over broadcasted axes for broadcast-add). Checking that every parameter's gradient has the same shape as the parameter confirms the central autograd invariant — the most common source of bugs when broadcasting is involved.

Exercise 14Code Lab
Implement softmax_backward and layernorm_backward; verify both to 1e-6 vs finite diff.

Solution

Implementing the VJPs derived in Exercises 5 and 8 and checking them against the Chapter-5 finite-difference checker to 1e−6 relative error validates the two trickiest backward passes in a Transformer — the ones most often gotten subtly wrong.

Exercise 15Code
Train a 3-layer MLP with your Tensor autograd; compare gradients to PyTorch.

Solution

Running both your engine and PyTorch on identical data and weights yields gradients agreeing to floating-point tolerance, confirming the Tensor autograd is correct on a full multi-layer network — the conceptual foundation for trusting automatic differentiation in the rest of the book.

Exercise 16Code
Demonstrate the accumulation bug: add using = instead of +=; show wrong grads for f=x+x.

Solution

With assignment, the second use of x overwrites the first path's gradient, so f = x+x reports ∂f/∂x = 1 instead of 2. Switching to += fixes it — a concrete reproduction of the multi-path accumulation rule from Exercise 3.

Exercise 17Code Lab
Measure activation memory: 20-layer MLP with/without gradient checkpointing.

Solution

Checkpointing reduces peak activation memory substantially (toward the O(√L) regime) at the cost of ~30% extra wall-clock time from recomputation — the empirical confirmation of Exercise 10's compute/memory trade-off.

Exercise 18Code (Challenge)
Extend the Tensor autograd to a full FFN block (Linear→GELU→Linear, LayerNorm, residual); verify vs PyTorch.

Solution

Assembling the block from your autograd primitives and checking end-to-end gradients against PyTorch (to ~1e−5) demonstrates that the hand-built engine handles every component of a Transformer sublayer — the autograd you can conceptually rely on for the remainder of the book.