Backpropagation in Depth
Detailed solutions for the exercises in Chapter 11. Try solving them yourself before checking the answers.
Solution
Nodes: a = x+y (local grads ∂a/∂x = ∂a/∂y = 1); b = x·y (∂b/∂x = y, ∂b/∂y = x); f = a·b (∂f/∂a = b, ∂f/∂b = a). Backprop multiplies and accumulates these local gradients along each path from f back to x and y.
Solution
Via the graph with a=5, b=6: ∂f/∂x = b·(∂a/∂x) + a·(∂b/∂x) = 6·1 + 5·y = 6 + 15 = 21. ∂f/∂y = 6·1 + 5·x = 6 + 10 = 16. Direct check: f = (x+y)(xy) = x²y + xy², so f_x = 2xy + y² = 12+9 = 21 and f_y = x² + 2xy = 4+12 = 16. ✓
Solution
When a variable feeds multiple downstream operations, the total derivative is the SUM of contributions along all paths (multivariate chain rule). For f = x + x, x feeds the add twice; the correct ∂f/∂x = 1 + 1 = 2. Using assignment (=) would overwrite the first path's gradient with the second, yielding 1 — wrong. Accumulation (+=) correctly sums both paths.
Solution
For Y = XW with X (n×d), W (d×m), Y (n×m), and upstream G = ∂L/∂Y (n×m): differentiating L through Y gives ∂L/∂X = G Wᵀ (shapes n×m · m×d = n×d, matching X) and ∂L/∂W = Xᵀ G (d×n · n×m = d×m, matching W). The transposes are forced by shape-matching, which is the quickest way to remember the matmul backward.
Solution
Using the Jacobian ∂p_i/∂z_j = p_i(δ_{ij} − p_j), the VJP is ∂L/∂z_j = Σ_i (∂L/∂p_i)·p_i(δ_{ij}−p_j) = p_j(∂L/∂p_j) − p_j·Σ_i(∂L/∂p_i)p_i. In vector form ∂L/∂z = p ⊙ (∂L/∂p − Σ_i(∂L/∂p_i)p_i). The subtracted scalar is the mean (under p) of the upstream gradient — the softmax 'centering' that also appears in the clean p−y form for cross-entropy.
Solution
The Jacobian of a 1000×1000→output map has on the order of (10⁶)×(10⁶) = 10¹² entries — enormous to store and form. The VJP computes (∂L/∂Y)Wᵀ directly as a matmul of the same size as the forward op (≈10⁹ FLOPs), never building the full Jacobian. Reverse-mode AD always propagates vector–Jacobian products, so it costs about the same as the forward pass and avoids the quadratic blowup.
Solution
Forward-mode propagates derivatives w.r.t. one input at a time, so getting all n=10⁹ parameter gradients costs ~n forward passes — hopeless. Reverse-mode propagates derivatives of the single scalar loss backward, getting ALL n gradients in one backward pass (cost ≈ one forward pass). With one output and billions of inputs, reverse mode is ~10⁹× cheaper — which is why all deep learning uses it.
Solution
With μ and σ² functions of all inputs, x̂ = (x−μ)/√(σ²+ε), the gradient must route through three paths (direct, through μ, through σ²). The result is ∂L/∂x_i = (1/√(σ²+ε))·[g_i − mean(g) − x̂_i·mean(g⊙x̂)], where g = ∂L/∂x̂ = upstream·γ. The two subtracted means are exactly the corrections for μ and σ² depending on every element — the standard fused LayerNorm-backward formula.
Solution
A node's gradient is only complete once ALL its downstream consumers have contributed. Reverse topological order guarantees every consumer is processed before the node itself. Counterexample: in a→b, a→c, b→d, c→d, if you compute ∂L/∂a before finishing both b and c, you'd use an incomplete accumulation and get the wrong gradient. Reverse-topo order ensures correctness.
Solution
Store activations only at every √L-th layer (√L checkpoints). During backward, recompute the √L activations within each segment on demand — needing memory for only one segment (√L layers) at a time, plus the √L checkpoints: O(√L) total instead of O(L). The cost is one extra forward pass over each segment, i.e. roughly one additional forward pass overall (≈33% more compute), trading compute for a quadratic memory saving.
Solution
Each operation records its inputs and a local backward closure that accumulates gradients (+=). For x^n the local grad is n·x^{n−1}; for −x it is −1; for tanh it is 1−tanh². A topological-sort backward then computes all gradients, which match finite differences to ~1e−5 — a working micro-autograd.
Solution
Building the MLP from scalar Value nodes and running gradient descent with the engine's backward drives XOR loss to ~0, proving the hand-built autograd can train a real (if tiny) network end to end.
Solution
Each tensor op stores a backward that produces gradients matching its inputs' shapes (summing over broadcasted axes for broadcast-add). Checking that every parameter's gradient has the same shape as the parameter confirms the central autograd invariant — the most common source of bugs when broadcasting is involved.
Solution
Implementing the VJPs derived in Exercises 5 and 8 and checking them against the Chapter-5 finite-difference checker to 1e−6 relative error validates the two trickiest backward passes in a Transformer — the ones most often gotten subtly wrong.
Solution
Running both your engine and PyTorch on identical data and weights yields gradients agreeing to floating-point tolerance, confirming the Tensor autograd is correct on a full multi-layer network — the conceptual foundation for trusting automatic differentiation in the rest of the book.
Solution
With assignment, the second use of x overwrites the first path's gradient, so f = x+x reports ∂f/∂x = 1 instead of 2. Switching to += fixes it — a concrete reproduction of the multi-path accumulation rule from Exercise 3.
Solution
Checkpointing reduces peak activation memory substantially (toward the O(√L) regime) at the cost of ~30% extra wall-clock time from recomputation — the empirical confirmation of Exercise 10's compute/memory trade-off.
Solution
Assembling the block from your autograd primitives and checking end-to-end gradients against PyTorch (to ~1e−5) demonstrates that the hand-built engine handles every component of a Transformer sublayer — the autograd you can conceptually rely on for the remainder of the book.