Solutions Appendix
Chapter 5

Numerical Methods & Stability

12 Solutions

Detailed solutions for the exercises in Chapter 5. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
float32 has 23 mantissa bits. Machine epsilon? Compare to float64.

Solution

Machine epsilon ≈ 2^−23 ≈ 1.19×10⁻⁷ for float32 (the smallest relative gap between representable numbers). float64 has 52 mantissa bits, giving ≈ 2^−52 ≈ 2.2×10⁻¹⁶ — about nine orders of magnitude more precise. This is why accumulations (sums, norms) are often done in float32/float64 even when storage is float16.

Exercise 2Pen & Paper
Show subtracting max from logits before softmax gives the same probabilities.

Solution

softmax(z)_i = e^{z_i}/Σ_j e^{z_j}. Subtracting c from every logit multiplies numerator and denominator by e^{−c}: e^{z_i−c}/Σ_j e^{z_j−c} = (e^{−c}e^{z_i})/(e^{−c}Σ e^{z_j}), and e^{−c} cancels. The probabilities are unchanged, but with c = max(z) the largest exponent becomes e^0 = 1, preventing overflow.

Exercise 3Pen & Paper
Prove logΣexp(z_i) = c + logΣexp(z_i−c). Why c=max for stability?

Solution

Σ e^{z_i} = e^c Σ e^{z_i−c}; taking logs gives logΣ e^{z_i} = c + logΣ e^{z_i−c}. Choosing c = max(z) makes the largest shifted exponent e^0 = 1 and all others ≤ 1, so no term overflows and the dominant term never underflows — the standard log-sum-exp trick.

Exercise 4Pen & Paper
A gradient is 5×10⁻⁹ (fp16 min 6×10⁻⁸). Does it underflow? What loss scale S fixes it?

Solution

Yes — 5×10⁻⁹ < 6×10⁻⁸, so it flushes to zero in fp16. To lift it above the minimum, scale the loss (and hence gradients) by S with S·5×10⁻⁹ ≥ 6×10⁻⁸, i.e. S ≥ 12. In practice a power of two such as S = 16 (or much larger, e.g. 1024, with dynamic loss scaling) is used; gradients are unscaled by S before the optimizer step. This is the core of mixed-precision training (Chapter 20).

Exercise 5Pen & Paper
Derive Welford's update M_n = M_{n−1} + (x_n−μ_{n−1})(x_n−μ_n); show it computes Σ(x_i−μ_n)².

Solution

Maintaining the running mean μ_n = μ_{n−1} + (x_n−μ_{n−1})/n, one can show the sum of squared deviations M_n = Σ_{i≤n}(x_i−μ_n)² satisfies M_n = M_{n−1} + (x_n−μ_{n−1})(x_n−μ_n). The cross terms from updating μ cancel exactly, so the recurrence accumulates the centered sum of squares in one pass, numerically stably — avoiding the catastrophic cancellation of the naive E[X²]−E[X]² formula (Exercise 10).

Exercise 6Pen & Paper
Central difference is O(h²), forward is O(h). For h=10⁻⁵, expected relative error of each?

Solution

Truncation error of forward difference ∝ h ≈ 10⁻⁵; central difference ∝ h² ≈ 10⁻¹⁰. Central differencing is far more accurate for the same step (though both are eventually limited by floating-point rounding, which grows as h shrinks — there is an optimal h balancing truncation and rounding).

Exercise 7Code
Implement naive and stable softmax; find the logit magnitude where naive first NaNs in fp32 and fp16.

Solution

Naive softmax overflows when e^{z} exceeds the float max: in fp32 (max ≈ 3.4×10³⁸) this happens around z ≈ 88; in fp16 (max ≈ 65504) around z ≈ 11. The stable version (subtract max first) never overflows at any magnitude, producing correct probabilities where the naive one returns NaN.

Exercise 8Code Lab 5.1
gradient_check a ReLU and a GELU layer; introduce a sign error in dReLU and confirm it's caught.

Solution

Finite-difference checking compares analytic gradients to (f(x+ε)−f(x−ε))/2ε. Correct ReLU (grad 1 for x>0, else 0) and GELU pass within ~1e−5. Flipping the sign of the ReLU derivative makes the analytic and numeric gradients disagree by ~2× the true value, which the check flags immediately — demonstrating gradient_check as a bug detector.

Exercise 9Code
Implement log_sum_exp; verify against scipy.special.logsumexp including edge cases.

Solution

Subtract the max before exponentiating: lse(z) = max(z) + logΣ e^{z−max(z)}. This matches scipy.special.logsumexp on random vectors and survives edge cases (all −inf → −inf; very large values → no overflow; very small → no underflow) where a naive log(Σ exp) fails.

Exercise 10Code
Demonstrate catastrophic cancellation: variance of [10⁸+1, 10⁸+2, 10⁸+3] in float32, naive vs two-pass.

Solution

The naive E[X²]−E[X]² subtracts two huge, nearly-equal float32 numbers (~10¹⁶), losing all significant digits — the result can be wildly wrong or even negative. The two-pass (or Welford) method subtracts the mean first, working with the small deviations [1,2,3], and recovers the true variance (≈0.667). This is the textbook example of catastrophic cancellation.

Exercise 11Code Lab 5.2
Run a 10-step training loop twice with the same seed; verify bit-identical, then enable nondeterminism and show divergence.

Solution

With a fixed seed and deterministic algorithms, the two runs produce bit-for-bit identical weights and losses. Enabling nondeterministic GPU kernels (e.g. atomic-add reductions whose order varies) makes the runs diverge in the low-order bits and then, through chaotic amplification, in visible digits — illustrating why reproducibility requires deterministic settings.

Exercise 12Code (Challenge)
Profile memory: train a 100M-param linear model for one batch in fp32 vs fp16 mixed precision; report the ratio.

Solution

Mixed precision stores activations and (often) a working copy of weights in fp16, roughly halving activation and gradient memory, though a master fp32 copy of weights and optimizer state remains. Measured with torch.cuda.max_memory_allocated(), peak memory typically drops to ~0.5–0.65× of fp32, with the exact ratio depending on how much of the footprint is activations (which halve) versus optimizer state (which may not).