Part I: Mathematical Foundations
Chapter 5

Numerical Methods & Stability

Floating point, numerics, and stable implementations
12 Exercises
5.1

Every number in a neural network is a finite approximation of a real number. The gap between the real number you intend and the floating-point number you get is small but it compounds across billions of operations. Understanding this is not pedantry — it is the difference between a training run that converges and one that silently diverges.

This chapter covers five classes of numerical problem that every ML engineer encounters:

5.2

Every float in your GPU is stored as three fields: a sign bit, an exponent, and a mantissa (significand). The exponent controls the magnitude (range); the mantissa controls the precision. This trade-off between range and precision is the root cause of most numerical problems in ML.

The Three Formats You’ll Use

bf16 vs fp16: Why bf16 Won for LLMs
Both bf16 and fp16 use 16 bits. The difference is how those bits are allocated. fp16 has a 5-bit exponent (max ~65504), which is easily overflowed by large attention logits or gradients. bf16 has the same 8-bit exponent as fp32 (max ~3.4×10³⁸), sacrificing mantissa precision instead.
For LLM training, range matters more than precision — gradients and activations span many orders of magnitude, and overflow to inf is catastrophic. bf16 almost never overflows and can be used without loss scaling, which is why it replaced fp16 for most large-scale training.
PythonFloating-point properties — inspecting what Python/PyTorch stores
import torch
import numpy as np

# ---- Machine epsilon: gap between 1.0 and next representable float ----
print(f"fp32 epsilon: {np.finfo(np.float32).eps:.2e}")
print(f"fp64 epsilon: {np.finfo(np.float64).eps:.2e}")
# fp32 epsilon: 1.19e-07   fp64 epsilon: 2.22e-16

# ---- Max and min values ----
for dtype in [torch.float32, torch.float16, torch.bfloat16]:
    info = torch.finfo(dtype)
    print(f"{str(dtype):>20}: max={info.max:.2e}  min_pos={info.tiny:.2e}  eps={info.eps:.2e}")
# torch.float32:  max=3.40e+38  min_pos=1.18e-38  eps=1.19e-07
# torch.float16:  max=6.55e+04  min_pos=6.10e-05  eps=9.77e-04
# torch.bfloat16: max=3.39e+38  min_pos=1.18e-38  eps=7.81e-03

# ---- Demonstrate rounding error ----
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 == 0.3 ? {a + b == np.float32(0.3)}")
print(f"0.1 + 0.2  =  {a + b:.20f}")
# False -- 0.30000001192092896000

# ---- fp16 overflow ----
x_fp16 = torch.tensor(65505.0, dtype=torch.float16)
print(f"65505 in fp16: {x_fp16}")
# inf -- overflowed!
x_bf16 = torch.tensor(65505.0, dtype=torch.bfloat16)
print(f"65505 in bf16: {x_bf16}")
# 65504.0 -- rounded but not inf
5.3

Overflow: Too Big for the Format

Overflow occurs when a computed value exceeds the maximum representable number. The result is ±inf. In neural networks, overflow most commonly occurs in: (1) the exponential in softmax for large logits, (2) fp16 gradients during backward passes, and (3) log-probability computations.

PythonOverflow in softmax — the motivating problem
import numpy as np

def naive_softmax(z):
    """WRONG: overflows for large logits."""
    return np.exp(z) / np.exp(z).sum()

# Small logits: fine
print(naive_softmax(np.array([1.0, 2.0, 3.0])))
# [0.09  0.245  0.665]

# Large logits: catastrophic overflow
big_logits = np.array([1000.0, 1001.0, 1002.0])
print(naive_softmax(big_logits))
# [nan  nan  nan]  -- exp(1000) = inf, inf/inf = nan

# This happens in real training with large learning rates or bad init!
# fp16 overflows at exp(x) for x > ln(65504) ≈ 11.09
import torch
z_fp16 = torch.tensor([12.0, 1.0, 0.0], dtype=torch.float16)
print(torch.exp(z_fp16))
# tensor([inf, 2.7168, 1.], dtype=torch.float16) -- overflowed at 12!

Underflow: Too Small for the Format

Underflow occurs when a positive value rounds to zero because it is smaller than the minimum representable positive number. In ML, this most commonly corrupts log-probability computations: log(0.0) = -inf.

PythonUnderflow in probability computations
import numpy as np

# Joint probability of a long sequence underflows to 0.0
token_probs = [0.1] * 300  # each token has 10% probability
joint_prob  = np.prod(token_probs)
print(f"Joint prob (float64): {joint_prob}")
# 0.0  -- underflowed! True value is 10^{-300}

# Solution: work in log space
log_joint = np.sum(np.log(token_probs))
print(f"Log joint prob: {log_joint:.2f}")
# -690.78 -- accurate, no underflow

# Recover probability only when needed (and check it fits fp64)
if log_joint > np.log(np.finfo(float).tiny):
    print(f"Prob: {np.exp(log_joint)}")
else:
    print("Prob too small to represent in float64; stay in log space")

Catastrophic Cancellation: Losing Significant Digits

When two nearly-equal numbers are subtracted, the result has far fewer significant digits than either operand. The relative error can be enormous even though the absolute error is small.

PythonCatastrophic cancellation — the variance disaster
import numpy as np

# Computing variance as E[X^2] - E[X]^2  (one-pass, textbook formula)
def naive_variance(x):
    """DANGEROUS: catastrophic cancellation for large-mean data."""
    n = len(x)
    sum_x  = sum(x)
    sum_x2 = sum(xi**2 for xi in x)
    return (sum_x2 - sum_x**2/n) / n

def stable_variance(x):
    """Safe: subtract mean first, then compute variance."""
    mean = sum(x) / len(x)
    return sum((xi - mean)**2 for xi in x) / len(x)

# Small mean: both fine
small = [1.0, 2.0, 3.0, 4.0, 5.0]
print(f"Naive:  {naive_variance(small):.10f}")
print(f"Stable: {stable_variance(small):.10f}")
# Both give 2.0000000000 -- fine here

# Large mean: catastrophic cancellation in naive formula
big_mean = [1e8 + xi for xi in [1.0, 2.0, 3.0, 4.0, 5.0]]
print(f"\nNaive (big mean):  {naive_variance(big_mean):.10f}")
print(f"Stable (big mean): {stable_variance(big_mean):.10f}")
print(f"True variance:     {2.0:.10f}")
# Naive (big mean):  0.0000000000  WRONG! (should be 2.0)
# Stable (big mean): 2.0000000000  correct
⚠️
Layer Norm Requires Stable Variance
PyTorch’s LayerNorm uses Welford’s one-pass stable algorithm internally. If you ever implement layer normalisation by hand (e.g., in a custom CUDA kernel), you must use Welford’s or subtract the mean first. The naive E[X²]-E[X]² formula will silently produce wrong variances for typical activation magnitudes.
5.4

The log-sum-exp (LSE) trick solves the overflow problem in softmax and log-likelihood computations with a mathematically equivalent transformation that shifts all exponentials into a safe range. Every serious ML library uses this internally.

Derivation

We want to compute log(Σ exp(z_i)) without overflow. The trick: subtract the maximum logit c = max(z) before exponentiating. Since we’re inside a log, we can add it back:

logΣ exp(z_i) = c + logΣ exp(z_i - c) where c = max(z_i)

With this substitution, the largest argument to exp is exp(0) = 1. All other arguments are ≤ 0, so exp(⋅) ∈ (0, 1]. No overflow is possible.

PythonLog-sum-exp — derivation, implementation, and applications
import numpy as np

def log_sum_exp(z):
    """
    Numerically stable log(sum(exp(z))).
    Equivalent to scipy.special.logsumexp.
    """
    c = z.max()             # shift (safe subtraction)
    return c + np.log(np.sum(np.exp(z - c)))

# ---- Test on large logits that would overflow naively ----
z = np.array([1000.0, 1001.0, 1002.0])
print(f"Naive:  {np.log(np.sum(np.exp(z)))}")
# inf -- exp(1000) overflowed
print(f"Stable: {log_sum_exp(z):.4f}")
# 1002.4076  -- correct!

# ---- Verify against scipy ----
from scipy.special import logsumexp
print(np.isclose(log_sum_exp(z), logsumexp(z)))
# True

# ---- Application 1: stable softmax ----
def stable_softmax(z):
    z = z - z.max()        # shift by max (does not change probabilities)
    e = np.exp(z)
    return e / e.sum()

print(stable_softmax(np.array([1000.0, 1001.0, 1002.0])))
# [0.0900  0.2447  0.6652]  -- correct, no overflow

# ---- Application 2: stable log-softmax ----
def stable_log_softmax(z):
    """log(softmax(z)) without computing softmax(z) first."""
    return z - log_sum_exp(z)

z = np.array([2.0, 1.0, 0.5])
log_p1 = stable_log_softmax(z)
log_p2 = np.log(stable_softmax(z))
print(np.allclose(log_p1, log_p2))
# True -- identical results, but log_softmax is more numerically stable

# ---- Application 3: stable cross-entropy loss ----
def stable_cross_entropy(logits, target_class):
    """Cross-entropy without risk of overflow or log(0)."""
    return -(logits[target_class] - log_sum_exp(logits))

# This is exactly what torch.nn.CrossEntropyLoss does internally
import torch, torch.nn.functional as F
logits_t = torch.tensor([1000.0, 1001.0, 1002.0])
target   = 2
print(f"Our stable CE:   {stable_cross_entropy(logits_t.numpy(), target):.4f}")
print(f"PyTorch CE:      {F.cross_entropy(logits_t.unsqueeze(0), torch.tensor([target])):.4f}")
# Both: 0.4076  -- correct and stable

LSE in Log-Likelihood Computations

When computing the log-likelihood of a sequence or marginalising over latent variables, log-sum-exp appears naturally anywhere you need to compute log of a sum of exponentials:

PythonLSE for marginalisation and log-likelihoods
import numpy as np
from scipy.special import logsumexp

# Marginalising over K latent states
# log P(x) = log sum_k P(x|z=k) P(z=k)
# = log sum_k exp(log P(x|z=k) + log P(z=k))
# = logsumexp(log_likelihoods + log_priors)

K = 5  # number of latent states (e.g., mixture components)
log_p_z = np.log(np.array([0.3, 0.2, 0.2, 0.15, 0.15]))  # log prior
log_p_x_given_z = np.array([-2.1, -3.5, -1.8, -4.2, -2.9])  # log likelihood

# Marginal log P(x): sum over all latent states
log_p_x = logsumexp(log_p_x_given_z + log_p_z)
print(f"log P(x) = {log_p_x:.4f}")

# Posterior: P(z|x) via Bayes (in log space, then softmax)
log_unnorm_post = log_p_x_given_z + log_p_z
log_post        = log_unnorm_post - log_p_x  # normalise
posterior       = np.exp(log_post)
print(f"Posterior P(z|x): {np.round(posterior, 3)}")
print(f"Sum: {posterior.sum():.4f}")
# Sum: 1.0000  -- correctly normalised
5.5

Layer normalisation computes the mean and variance of activations in a single forward pass. Naïvely, variance is computed as E[X²] - E[X]², which suffers catastrophic cancellation (Section 5.3). Welford’s algorithm computes mean and variance in one pass without this problem.

The Algorithm

Welford’s key insight: instead of accumulating sum and sum-of-squares, accumulate the mean and a running sum of squared deviations from the current mean. This keeps all values close to each other in magnitude, avoiding cancellation.

For each new sample x_n: δ = x_n - μ_{n-1} μ_n = μ_{n-1} + δ/n M_n = M_{n-1} + δ(x_n - μ_n)
Variance = M_n / n (population) or M_n / (n-1) (sample, unbiased)
PythonWelford’s algorithm — implementation and validation
import numpy as np

class WelfordStats:
    """Online computation of mean and variance using Welford's algorithm."""
    def __init__(self):
        self.n, self.mean, self.M = 0, 0.0, 0.0

    def update(self, x):
        self.n += 1
        delta      = x - self.mean
        self.mean += delta / self.n               # update mean
        delta2     = x - self.mean           # note: uses NEW mean
        self.M    += delta * delta2              # running sum of sq. deviations

    @property
    def variance(self): return self.M / self.n         # population

    @property
    def sample_variance(self): return self.M / (self.n - 1) # unbiased

# ---- Test on large-mean data (naive formula fails here) ----
data = [1e8 + xi for xi in [1.0, 2.0, 3.0, 4.0, 5.0]]

w = WelfordStats()
for xi in data: w.update(xi)

print(f"Welford mean:      {w.mean:.6f}")
print(f"Welford variance:  {w.variance:.6f}")
print(f"NumPy variance:    {np.var(data):.6f}")
print(f"True variance:     2.000000")
# Welford: 2.000000  NumPy: 2.000000  (both correct; naive formula gives 0)

# ---- Layer Norm using Welford-style stable computation ----
def layer_norm_stable(x, eps=1e-5):
    """
    Numerically stable layer normalisation.
    x: (*, D) tensor-like. Normalises over last dimension.
    """
    mu  = x.mean(axis=-1, keepdims=True)  # stable: subtract mean first
    var = ((x - mu)**2).mean(axis=-1, keepdims=True)
    return (x - mu) / np.sqrt(var + eps)

acts = np.random.randn(4, 768) * 1e3 + 1e6  # large-mean activations
normed = layer_norm_stable(acts)
print(f"Mean after LN:  {normed.mean(axis=-1).mean():.6f}")
print(f"Std  after LN:  {normed.std(axis=-1).mean():.6f}")
# Mean: ≈0.000000  Std: ≈1.000000
5.6

Even experienced engineers make mistakes when implementing custom backward passes. Gradient checking is the gold-standard test: compute gradients analytically (via backprop) and numerically (via finite differences), then compare. Any discrepancy reveals a bug.

Central Finite Differences

The central difference formula has O(h²) error (vs. O(h) for forward differences), making it far more accurate for a given step size h:

∂f/∂θ_i ≈ [f(θ + h·e_i) - f(θ - h·e_i)] / (2h) where h ≈ 10⁻⁵
PythonCode Lab 5.1 — Universal gradient checker
import numpy as np

def gradient_check(loss_fn, params, h=1e-5, rtol=1e-4, verbose=True):
    """
    Check analytical gradients against central finite differences.
    loss_fn: callable (params) -> (scalar_loss, list_of_analytic_grads)
    params:  list of numpy arrays (the parameter tensors)
    Returns: True if all gradients pass, False otherwise.
    """
    _, analytic_grads = loss_fn(params)
    all_passed = True

    for pi, (p, ag) in enumerate(zip(params, analytic_grads)):
        num_grad = np.zeros_like(p, dtype=np.float64)
        for idx in np.ndindex(p.shape):
            p_plus  = [q.copy() for q in params]
            p_minus = [q.copy() for q in params]
            p_plus[pi][idx]  += h
            p_minus[pi][idx] -= h
            L_p, _ = loss_fn(p_plus)
            L_m, _ = loss_fn(p_minus)
            num_grad[idx] = (L_p - L_m) / (2 * h)

        # Relative error: |analytic - numeric| / (|analytic| + |numeric| + eps)
        err = np.abs(ag - num_grad)
        rel_err = (err / (np.abs(ag) + np.abs(num_grad) + 1e-8)).max()
        ok  = rel_err < rtol
        all_passed = all_passed and ok
        if verbose:
            mark = '✔ PASS' if ok else '✘ FAIL'
            print(f"  Param {pi} {p.shape}: max_rel_err={rel_err:.2e}  {mark}")
    return all_passed

# ---- Example: attention score computation ----
def attention_loss(params):
    """Scaled dot-product attention loss for gradient checking."""
    Q, K, V = params
    d = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d)      # (T, T)
    # Stable softmax
    scores -= scores.max(axis=-1, keepdims=True)
    attn = np.exp(scores); attn /= attn.sum(axis=-1, keepdims=True)
    out  = attn @ V                      # (T, D)
    loss = float(out.mean()**2)           # scalar

    # Backward pass
    dout  = 2 * out.mean() * np.ones_like(out) / out.size
    dV    = attn.T @ dout
    dattn = dout @ V.T
    # Softmax backward: ds = diag(attn) - attn^T attn (vectorised)
    ds    = attn * (dattn - (dattn * attn).sum(axis=-1, keepdims=True))
    dQ    = ds @ K / np.sqrt(d)
    dK    = ds.T @ Q / np.sqrt(d)
    return loss, [dQ, dK, dV]

np.random.seed(0)
T, D = 4, 8
Q0 = np.random.randn(T, D)
K0 = np.random.randn(T, D)
V0 = np.random.randn(T, D)
print("Gradient check for attention:")
gradient_check(attention_loss, [Q0, K0, V0])
# ✔ PASS for all three  (Q, K, V)

When Gradient Checks Fail

A failing gradient check always indicates a bug. The most common causes:

Wrong sign in a gradient computation (e.g., forgetting to negate in cross-entropy)
Using the wrong cached values in the backward pass (stale activations from a previous forward)
Missing the chain rule at one node (multiplying by 1 instead of the local gradient)
Incorrect handling of broadcasting (gradient must be summed over broadcast dimensions)
Using non-differentiable operations (argmax, floor) where a smooth approximation was intended
5.7

Training large models in full fp32 wastes memory and compute. Mixed-precision training uses lower-precision formats (fp16 or bf16) for the forward pass and gradient computations, while maintaining a full-precision fp32 copy of the weights for the optimizer update. This halves memory usage and roughly doubles throughput on modern GPUs.

The Mixed-Precision Workflow

PythonMixed-precision training with torch.cuda.amp
import torch
import torch.nn as nn

# Setup
model = nn.Linear(768, 768).cuda()
optim = torch.optim.AdamW(model.parameters(), lr=3e-4)

# GradScaler: manages loss scaling automatically for fp16
# For bf16, scaler is not needed (no overflow risk)
scaler = torch.cuda.amp.GradScaler()  # fp16 training

# Simulated training step
x = torch.randn(32, 768).cuda()
y = torch.randn(32, 768).cuda()

# --- fp16 training loop ---
optim.zero_grad()
with torch.cuda.amp.autocast(dtype=torch.float16):
    pred = model(x)            # forward in fp16
    loss = nn.functional.mse_loss(pred, y)
scaler.scale(loss).backward()       # scaled backward
scaler.unscale_(optim)              # unscale before clip
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optim)                 # update if no inf/nan
scaler.update()                    # adjust scale factor

# --- bf16 training loop (simpler: no scaler needed) ---
optim.zero_grad()
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    pred = model(x)
    loss = nn.functional.mse_loss(pred, y)
loss.backward()  # no scaling needed for bf16
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optim.step()

Loss Scaling: Why it Exists and How It Works

In fp16, small gradient values (< 6×10⁻⁸) underflow to zero and are lost. Loss scaling multiplies the loss by a large constant S before backward, scaling all gradients up by S. After backward, divide by S before updating weights. This rescues small gradients from underflow without changing the optimisation.

PythonLoss scaling — manual implementation to understand it
import torch

def training_step_with_manual_scaling(model, x, y, optim, scale=65536.0):
    """Manual loss scaling to rescue fp16 underflowed gradients."""
    optim.zero_grad()

    with torch.cuda.amp.autocast(dtype=torch.float16):
        loss = ((model(x) - y)**2).mean()

    # Scale loss BEFORE backward (scales all gradients by S)
    (loss * scale).backward()

    # Check for overflow in scaled gradients
    has_inf = any(p.grad.isinf().any() or p.grad.isnan().any()
              for p in model.parameters() if p.grad is not None)

    if has_inf:
        print("Overflow detected — skipping update, halving scale")
        optim.zero_grad()
        return scale / 2  # reduce scale

    # Unscale before weight update
    for p in model.parameters():
        if p.grad is not None:
            p.grad /= scale

    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optim.step()
    return min(scale * 2, 65536.0)  # increase scale if no overflow
ML Connection: Dynamic Loss Scaling
PyTorch’s GradScaler implements dynamic loss scaling automatically. It starts with a large scale factor (default 65536) and doubles it every 2000 successful steps. When it detects overflow, it halves the scale and skips the weight update. This adaptive behaviour keeps the scale as large as possible (rescuing the most gradients from underflow) while avoiding overflow.
5.8

A training run is reproducible if running the same code with the same data produces the same result. Perfect reproducibility is often impossible in practice due to non-deterministic CUDA operations, but controllable seeds can get you 99% of the way there — which is essential for ablation studies and debugging.

Sources of Non-Determinism

PythonCode Lab 5.2 — Fully reproducible training setup
import os, random, numpy as np, torch

def set_seed(seed: int, deterministic: bool = True):
    """
    Set all random seeds for reproducible training.
    deterministic=True: fully reproducible but ~10-20% slower.
    """
    # 1. Python built-in random
    random.seed(seed)

    # 2. NumPy
    np.random.seed(seed)

    # 3. PyTorch CPU and CUDA
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # for multi-GPU

    # 4. cuDNN determinism (trades speed for reproducibility)
    torch.backends.cudnn.deterministic = deterministic
    torch.backends.cudnn.benchmark    = not deterministic

    # 5. PyTorch deterministic algorithms (catches more ops)
    if deterministic:
        torch.use_deterministic_algorithms(True, warn_only=True)

    # 6. Environment variable for any remaining CUDA sources
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

    print(f"All seeds set to {seed}. Deterministic: {deterministic}")

# Usage at the top of every training script
set_seed(42, deterministic=True)

# Verify: two runs with same seed produce identical outputs
set_seed(0)
out1 = torch.randn(3, 3)
set_seed(0)
out2 = torch.randn(3, 3)
print(torch.allclose(out1, out2))
# True -- identical

What to Log for Reproducibility

Even with seeds set, environment differences can cause variation. Always log:

Exact commit hash of the codebase (git rev-parse HEAD)
All hyperparameters (use a config file or argparse, logged to wandb/mlflow)
Library versions: torch, transformers, numpy, cuda (torch.__version__, torch.version.cuda)
Hardware: GPU model and count (nvidia-smi)
Random seed used
Dataset hash or version identifier
5.9

Numerical Stability Checklist

When you see NaN loss, inf gradients, or wildly wrong outputs, run through this checklist:

Quick Reference: Stable Implementations

Exercises

Exercises 1–6 are pen-and-paper; 7–12 require code.

Exercise 1: Pen & Paper
A float32 number is stored with 23 mantissa bits. What is the maximum relative rounding error (machine epsilon)? How does this compare to float64?
Exercise 2: Pen & Paper
Show algebraically that subtracting the max from logits before softmax produces the same probabilities. (Hint: the max cancels out.)
Exercise 3: Pen & Paper
Prove that logΣexp(z_i) = c + logΣexp(z_i-c) for any constant c. Why does c=max(z) give the best numerical stability?
Exercise 4: Pen & Paper
A gradient is 5×10⁻⁹. In fp16 (min positive: 6×10⁻⁸), does it underflow? What loss scale S is needed to bring it above the fp16 minimum?
Exercise 5: Pen & Paper
Derive Welford’s update equation: M_n = M_{n-1} + (x_n - μ_{n-1})(x_n - μ_n). Show it computes Σ(x_i - μ_n)².
Exercise 6: Pen & Paper
The central difference formula has O(h²) error while the forward difference has O(h). For h=10⁻⁵, what is the expected relative error of each?
Exercise 7: Code
Implement naive_softmax and stable_softmax. Find the exact logit magnitude at which naive_softmax first produces NaN in fp32 and fp16.
Exercise 8: Code Lab 5.1
Apply gradient_check to a custom ReLU layer and a GELU layer. Verify both pass. Then introduce a sign error in dReLU and confirm the check catches it.
Exercise 9: Code
Implement log_sum_exp from scratch. Verify it matches scipy.special.logsumexp on 10,000 random vectors including edge cases (all-inf, very large, very small).
Exercise 10: Code
Demonstrate catastrophic cancellation: compute the variance of [10^8+1, 10^8+2, 10^8+3] in float32 using (a) naive E[X^2]-E[X]^2, (b) stable two-pass formula. Report error vs. the true value.
Exercise 11: Code Lab 5.2
Run a 10-step training loop on a small MLP twice with the same seed. Verify outputs are bit-for-bit identical. Then enable non-deterministic algorithms and show they diverge.
Exercise 12: Code (Challenge)
Profile memory usage: train a 100M-parameter linear model for one batch in fp32 vs. fp16 mixed precision. Measure peak GPU memory with torch.cuda.max_memory_allocated(). Report the ratio.

Further reading: “What Every Computer Scientist Should Know About Floating-Point Arithmetic” (Goldberg, 1991) — the definitive reference, freely available online. “Numerical Optimization” (Nocedal & Wright) Chapter 8 for gradient checking theory. The PyTorch documentation on AMP (Automatic Mixed Precision) and GradScaler.

You have now covered the complete mathematical toolkit for understanding every design decision, loss function, and training recipe in modern LLMs. The five chapters of Part I connect as follows:

Part II begins with classical machine learning: discriminative models, generative approaches, word embeddings, and RNNs. Every concept builds directly on the foundations established here — and the Transformer in Part III will feel like a natural evolution rather than an unexplained black box.

12 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →