Part I: Mathematical Foundations

Chapter 5

Numerical Methods & Stability

Floating point, numerics, and stable implementations

12 Exercises

Learning Objectives

1.	Understand IEEE 754 floating-point representation and why it leads to rounding errors.
2.	Identify and prevent overflow, underflow, and catastrophic cancellation in ML code.
3.	Implement the log-sum-exp trick and numerically stable softmax from first principles.
4.	Build a robust gradient checker using central finite differences.
5.	Understand mixed-precision training (fp16/bf16) and implement loss scaling correctly.
6.	Write a stable online variance algorithm (Welford’s method) as used in layer normalisation.
7.	Set up fully reproducible training runs and understand why perfect reproducibility is limited.

Every number in a neural network is a finite approximation of a real number. The gap between the real number you intend and the floating-point number you get is small but it compounds across billions of operations. Understanding this is not pedantry — it is the difference between a training run that converges and one that silently diverges.

This chapter covers five classes of numerical problem that every ML engineer encounters:

Problem	Symptom	Where in ML
Rounding error	Wrong decimal places	Accumulating weight updates over many steps
Overflow	inf in activations or gradients	Large logits, exploding gradients, exp(x) for big x
Underflow	0.0 where a small positive was expected	log(0) = -inf in cross-entropy; softmax denominator
Catastrophic cancellation	Significant digits lost in subtraction	Variance computation, residual connections at init
Numerical instability	NaN loss that appears suddenly	Incorrect softmax, unnormalised attention weights

Every float in your GPU is stored as three fields: a sign bit, an exponent, and a mantissa (significand). The exponent controls the magnitude (range); the mantissa controls the precision. This trade-off between range and precision is the root cause of most numerical problems in ML.

The Three Formats You’ll Use

Format	Total bits	Exponent	Mantissa	Max value	Min positive
fp64 (double)	64	11 bits	52 bits	~1.8×10³⁰⁸	~5×10⁻³²⁴
fp32 (float)	32	8 bits	23 bits	~3.4×10³⁸	~1.2×10⁻³⁸
bf16 (bfloat)	16	8 bits	7 bits	~3.4×10³⁸	~1.2×10⁻³⁸
fp16 (half)	16	5 bits	10 bits	65504	~6×10⁻⁸
fp8 (E4M3)	8	4 bits	3 bits	448	~1.5×10⁻⁶

✧

bf16 vs fp16: Why bf16 Won for LLMs

Both bf16 and fp16 use 16 bits. The difference is how those bits are allocated. fp16 has a 5-bit exponent (max ~65504), which is easily overflowed by large attention logits or gradients. bf16 has the same 8-bit exponent as fp32 (max ~3.4×10³⁸), sacrificing mantissa precision instead.

For LLM training, range matters more than precision — gradients and activations span many orders of magnitude, and overflow to inf is catastrophic. bf16 almost never overflows and can be used without loss scaling, which is why it replaced fp16 for most large-scale training.

Python•Floating-point properties — inspecting what Python/PyTorch stores
import torch
import numpy as np

# ---- Machine epsilon: gap between 1.0 and next representable float ----
print(f"fp32 epsilon: {np.finfo(np.float32).eps:.2e}")
print(f"fp64 epsilon: {np.finfo(np.float64).eps:.2e}")
# fp32 epsilon: 1.19e-07   fp64 epsilon: 2.22e-16

# ---- Max and min values ----
for dtype in [torch.float32, torch.float16, torch.bfloat16]:
    info = torch.finfo(dtype)
    print(f"{str(dtype):>20}: max={info.max:.2e}  min_pos={info.tiny:.2e}  eps={info.eps:.2e}")
# torch.float32:  max=3.40e+38  min_pos=1.18e-38  eps=1.19e-07
# torch.float16:  max=6.55e+04  min_pos=6.10e-05  eps=9.77e-04
# torch.bfloat16: max=3.39e+38  min_pos=1.18e-38  eps=7.81e-03

# ---- Demonstrate rounding error ----
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 == 0.3 ? {a + b == np.float32(0.3)}")
print(f"0.1 + 0.2  =  {a + b:.20f}")
# False -- 0.30000001192092896000

# ---- fp16 overflow ----
x_fp16 = torch.tensor(65505.0, dtype=torch.float16)
print(f"65505 in fp16: {x_fp16}")
# inf -- overflowed!
x_bf16 = torch.tensor(65505.0, dtype=torch.bfloat16)
print(f"65505 in bf16: {x_bf16}")
# 65504.0 -- rounded but not inf

Overflow: Too Big for the Format

Overflow occurs when a computed value exceeds the maximum representable number. The result is ±inf. In neural networks, overflow most commonly occurs in: (1) the exponential in softmax for large logits, (2) fp16 gradients during backward passes, and (3) log-probability computations.

Python•Overflow in softmax — the motivating problem
import numpy as np

def naive_softmax(z):
    """WRONG: overflows for large logits."""
    return np.exp(z) / np.exp(z).sum()

# Small logits: fine
print(naive_softmax(np.array([1.0, 2.0, 3.0])))
# [0.09  0.245  0.665]

# Large logits: catastrophic overflow
big_logits = np.array([1000.0, 1001.0, 1002.0])
print(naive_softmax(big_logits))
# [nan  nan  nan]  -- exp(1000) = inf, inf/inf = nan

# This happens in real training with large learning rates or bad init!
# fp16 overflows at exp(x) for x > ln(65504) ≈ 11.09
import torch
z_fp16 = torch.tensor([12.0, 1.0, 0.0], dtype=torch.float16)
print(torch.exp(z_fp16))
# tensor([inf, 2.7168, 1.], dtype=torch.float16) -- overflowed at 12!

Underflow: Too Small for the Format

Underflow occurs when a positive value rounds to zero because it is smaller than the minimum representable positive number. In ML, this most commonly corrupts log-probability computations: log(0.0) = -inf.

Python•Underflow in probability computations
import numpy as np

# Joint probability of a long sequence underflows to 0.0
token_probs = [0.1] * 300  # each token has 10% probability
joint_prob  = np.prod(token_probs)
print(f"Joint prob (float64): {joint_prob}")
# 0.0  -- underflowed! True value is 10^{-300}

# Solution: work in log space
log_joint = np.sum(np.log(token_probs))
print(f"Log joint prob: {log_joint:.2f}")
# -690.78 -- accurate, no underflow

# Recover probability only when needed (and check it fits fp64)
if log_joint > np.log(np.finfo(float).tiny):
    print(f"Prob: {np.exp(log_joint)}")
else:
    print("Prob too small to represent in float64; stay in log space")

Catastrophic Cancellation: Losing Significant Digits

When two nearly-equal numbers are subtracted, the result has far fewer significant digits than either operand. The relative error can be enormous even though the absolute error is small.

Python•Catastrophic cancellation — the variance disaster
import numpy as np

# Computing variance as E[X^2] - E[X]^2  (one-pass, textbook formula)
def naive_variance(x):
    """DANGEROUS: catastrophic cancellation for large-mean data."""
    n = len(x)
    sum_x  = sum(x)
    sum_x2 = sum(xi**2 for xi in x)
    return (sum_x2 - sum_x**2/n) / n

def stable_variance(x):
    """Safe: subtract mean first, then compute variance."""
    mean = sum(x) / len(x)
    return sum((xi - mean)**2 for xi in x) / len(x)

# Small mean: both fine
small = [1.0, 2.0, 3.0, 4.0, 5.0]
print(f"Naive:  {naive_variance(small):.10f}")
print(f"Stable: {stable_variance(small):.10f}")
# Both give 2.0000000000 -- fine here

# Large mean: catastrophic cancellation in naive formula
big_mean = [1e8 + xi for xi in [1.0, 2.0, 3.0, 4.0, 5.0]]
print(f"\nNaive (big mean):  {naive_variance(big_mean):.10f}")
print(f"Stable (big mean): {stable_variance(big_mean):.10f}")
print(f"True variance:     {2.0:.10f}")
# Naive (big mean):  0.0000000000  WRONG! (should be 2.0)
# Stable (big mean): 2.0000000000  correct

⚠️

Layer Norm Requires Stable Variance

PyTorch’s LayerNorm uses Welford’s one-pass stable algorithm internally. If you ever implement layer normalisation by hand (e.g., in a custom CUDA kernel), you must use Welford’s or subtract the mean first. The naive E[X²]-E[X]² formula will silently produce wrong variances for typical activation magnitudes.

The log-sum-exp (LSE) trick solves the overflow problem in softmax and log-likelihood computations with a mathematically equivalent transformation that shifts all exponentials into a safe range. Every serious ML library uses this internally.

Derivation

We want to compute log(Σ exp(z_i)) without overflow. The trick: subtract the maximum logit c = max(z) before exponentiating. Since we’re inside a log, we can add it back:

logΣ exp(z_i) = c + logΣ exp(z_i - c) where c = max(z_i)

With this substitution, the largest argument to exp is exp(0) = 1. All other arguments are ≤ 0, so exp(⋅) ∈ (0, 1]. No overflow is possible.

Python•Log-sum-exp — derivation, implementation, and applications
import numpy as np

def log_sum_exp(z):
    """
    Numerically stable log(sum(exp(z))).
    Equivalent to scipy.special.logsumexp.
    """
    c = z.max()             # shift (safe subtraction)
    return c + np.log(np.sum(np.exp(z - c)))

# ---- Test on large logits that would overflow naively ----
z = np.array([1000.0, 1001.0, 1002.0])
print(f"Naive:  {np.log(np.sum(np.exp(z)))}")
# inf -- exp(1000) overflowed
print(f"Stable: {log_sum_exp(z):.4f}")
# 1002.4076  -- correct!

# ---- Verify against scipy ----
from scipy.special import logsumexp
print(np.isclose(log_sum_exp(z), logsumexp(z)))
# True

# ---- Application 1: stable softmax ----
def stable_softmax(z):
    z = z - z.max()        # shift by max (does not change probabilities)
    e = np.exp(z)
    return e / e.sum()

print(stable_softmax(np.array([1000.0, 1001.0, 1002.0])))
# [0.0900  0.2447  0.6652]  -- correct, no overflow

# ---- Application 2: stable log-softmax ----
def stable_log_softmax(z):
    """log(softmax(z)) without computing softmax(z) first."""
    return z - log_sum_exp(z)

z = np.array([2.0, 1.0, 0.5])
log_p1 = stable_log_softmax(z)
log_p2 = np.log(stable_softmax(z))
print(np.allclose(log_p1, log_p2))
# True -- identical results, but log_softmax is more numerically stable

# ---- Application 3: stable cross-entropy loss ----
def stable_cross_entropy(logits, target_class):
    """Cross-entropy without risk of overflow or log(0)."""
    return -(logits[target_class] - log_sum_exp(logits))

# This is exactly what torch.nn.CrossEntropyLoss does internally
import torch, torch.nn.functional as F
logits_t = torch.tensor([1000.0, 1001.0, 1002.0])
target   = 2
print(f"Our stable CE:   {stable_cross_entropy(logits_t.numpy(), target):.4f}")
print(f"PyTorch CE:      {F.cross_entropy(logits_t.unsqueeze(0), torch.tensor([target])):.4f}")
# Both: 0.4076  -- correct and stable

LSE in Log-Likelihood Computations

When computing the log-likelihood of a sequence or marginalising over latent variables, log-sum-exp appears naturally anywhere you need to compute log of a sum of exponentials:

Python•LSE for marginalisation and log-likelihoods
import numpy as np
from scipy.special import logsumexp

# Marginalising over K latent states
# log P(x) = log sum_k P(x|z=k) P(z=k)
# = log sum_k exp(log P(x|z=k) + log P(z=k))
# = logsumexp(log_likelihoods + log_priors)

K = 5  # number of latent states (e.g., mixture components)
log_p_z = np.log(np.array([0.3, 0.2, 0.2, 0.15, 0.15]))  # log prior
log_p_x_given_z = np.array([-2.1, -3.5, -1.8, -4.2, -2.9])  # log likelihood

# Marginal log P(x): sum over all latent states
log_p_x = logsumexp(log_p_x_given_z + log_p_z)
print(f"log P(x) = {log_p_x:.4f}")

# Posterior: P(z|x) via Bayes (in log space, then softmax)
log_unnorm_post = log_p_x_given_z + log_p_z
log_post        = log_unnorm_post - log_p_x  # normalise
posterior       = np.exp(log_post)
print(f"Posterior P(z|x): {np.round(posterior, 3)}")
print(f"Sum: {posterior.sum():.4f}")
# Sum: 1.0000  -- correctly normalised

Layer normalisation computes the mean and variance of activations in a single forward pass. Naïvely, variance is computed as E[X²] - E[X]², which suffers catastrophic cancellation (Section 5.3). Welford’s algorithm computes mean and variance in one pass without this problem.

The Algorithm

Welford’s key insight: instead of accumulating sum and sum-of-squares, accumulate the mean and a running sum of squared deviations from the current mean. This keeps all values close to each other in magnitude, avoiding cancellation.

For each new sample x_n: δ = x_n - μ_{n-1} μ_n = μ_{n-1} + δ/n M_n = M_{n-1} + δ(x_n - μ_n)

Variance = M_n / n (population) or M_n / (n-1) (sample, unbiased)

Python•Welford’s algorithm — implementation and validation
import numpy as np

class WelfordStats:
    """Online computation of mean and variance using Welford's algorithm."""
    def __init__(self):
        self.n, self.mean, self.M = 0, 0.0, 0.0

    def update(self, x):
        self.n += 1
        delta      = x - self.mean
        self.mean += delta / self.n               # update mean
        delta2     = x - self.mean           # note: uses NEW mean
        self.M    += delta * delta2              # running sum of sq. deviations

    @property
    def variance(self): return self.M / self.n         # population

    @property
    def sample_variance(self): return self.M / (self.n - 1) # unbiased

# ---- Test on large-mean data (naive formula fails here) ----
data = [1e8 + xi for xi in [1.0, 2.0, 3.0, 4.0, 5.0]]

w = WelfordStats()
for xi in data: w.update(xi)

print(f"Welford mean:      {w.mean:.6f}")
print(f"Welford variance:  {w.variance:.6f}")
print(f"NumPy variance:    {np.var(data):.6f}")
print(f"True variance:     2.000000")
# Welford: 2.000000  NumPy: 2.000000  (both correct; naive formula gives 0)

# ---- Layer Norm using Welford-style stable computation ----
def layer_norm_stable(x, eps=1e-5):
    """
    Numerically stable layer normalisation.
    x: (*, D) tensor-like. Normalises over last dimension.
    """
    mu  = x.mean(axis=-1, keepdims=True)  # stable: subtract mean first
    var = ((x - mu)**2).mean(axis=-1, keepdims=True)
    return (x - mu) / np.sqrt(var + eps)

acts = np.random.randn(4, 768) * 1e3 + 1e6  # large-mean activations
normed = layer_norm_stable(acts)
print(f"Mean after LN:  {normed.mean(axis=-1).mean():.6f}")
print(f"Std  after LN:  {normed.std(axis=-1).mean():.6f}")
# Mean: ≈0.000000  Std: ≈1.000000

Even experienced engineers make mistakes when implementing custom backward passes. Gradient checking is the gold-standard test: compute gradients analytically (via backprop) and numerically (via finite differences), then compare. Any discrepancy reveals a bug.

Central Finite Differences

The central difference formula has O(h²) error (vs. O(h) for forward differences), making it far more accurate for a given step size h:

∂f/∂θ_i ≈ [f(θ + h·e_i) - f(θ - h·e_i)] / (2h) where h ≈ 10⁻⁵

Python•Code Lab 5.1 — Universal gradient checker
import numpy as np

def gradient_check(loss_fn, params, h=1e-5, rtol=1e-4, verbose=True):
    """
    Check analytical gradients against central finite differences.
    loss_fn: callable (params) -> (scalar_loss, list_of_analytic_grads)
    params:  list of numpy arrays (the parameter tensors)
    Returns: True if all gradients pass, False otherwise.
    """
    _, analytic_grads = loss_fn(params)
    all_passed = True

    for pi, (p, ag) in enumerate(zip(params, analytic_grads)):
        num_grad = np.zeros_like(p, dtype=np.float64)
        for idx in np.ndindex(p.shape):
            p_plus  = [q.copy() for q in params]
            p_minus = [q.copy() for q in params]
            p_plus[pi][idx]  += h
            p_minus[pi][idx] -= h
            L_p, _ = loss_fn(p_plus)
            L_m, _ = loss_fn(p_minus)
            num_grad[idx] = (L_p - L_m) / (2 * h)

        # Relative error: |analytic - numeric| / (|analytic| + |numeric| + eps)
        err = np.abs(ag - num_grad)
        rel_err = (err / (np.abs(ag) + np.abs(num_grad) + 1e-8)).max()
        ok  = rel_err < rtol
        all_passed = all_passed and ok
        if verbose:
            mark = '✔ PASS' if ok else '✘ FAIL'
            print(f"  Param {pi} {p.shape}: max_rel_err={rel_err:.2e}  {mark}")
    return all_passed

# ---- Example: attention score computation ----
def attention_loss(params):
    """Scaled dot-product attention loss for gradient checking."""
    Q, K, V = params
    d = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d)      # (T, T)
    # Stable softmax
    scores -= scores.max(axis=-1, keepdims=True)
    attn = np.exp(scores); attn /= attn.sum(axis=-1, keepdims=True)
    out  = attn @ V                      # (T, D)
    loss = float(out.mean()**2)           # scalar

    # Backward pass
    dout  = 2 * out.mean() * np.ones_like(out) / out.size
    dV    = attn.T @ dout
    dattn = dout @ V.T
    # Softmax backward: ds = diag(attn) - attn^T attn (vectorised)
    ds    = attn * (dattn - (dattn * attn).sum(axis=-1, keepdims=True))
    dQ    = ds @ K / np.sqrt(d)
    dK    = ds.T @ Q / np.sqrt(d)
    return loss, [dQ, dK, dV]

np.random.seed(0)
T, D = 4, 8
Q0 = np.random.randn(T, D)
K0 = np.random.randn(T, D)
V0 = np.random.randn(T, D)
print("Gradient check for attention:")
gradient_check(attention_loss, [Q0, K0, V0])
# ✔ PASS for all three  (Q, K, V)

When Gradient Checks Fail

A failing gradient check always indicates a bug. The most common causes:

•Wrong sign in a gradient computation (e.g., forgetting to negate in cross-entropy)

•Using the wrong cached values in the backward pass (stale activations from a previous forward)

•Missing the chain rule at one node (multiplying by 1 instead of the local gradient)

•Incorrect handling of broadcasting (gradient must be summed over broadcast dimensions)

•Using non-differentiable operations (argmax, floor) where a smooth approximation was intended

Training large models in full fp32 wastes memory and compute. Mixed-precision training uses lower-precision formats (fp16 or bf16) for the forward pass and gradient computations, while maintaining a full-precision fp32 copy of the weights for the optimizer update. This halves memory usage and roughly doubles throughput on modern GPUs.

The Mixed-Precision Workflow

Step	Operation	Precision & rationale
1	Store master weights	fp32 — exact weight updates require full precision
2	Cast to low precision	fp16 or bf16 — fast GPU tensor cores operate at 16-bit
3	Forward pass	fp16/bf16 — activations and intermediate values at half precision
4	Compute loss	fp32 — scalar accumulation avoids underflow in the loss
5	Scale loss (fp16 only)	fp32 — multiply by loss_scale to rescue small gradients
6	Backward pass	fp16/bf16 — gradients in half precision
7	Unscale gradients	fp32 — divide by loss_scale; check for inf/nan
8	Clip & update weights	fp32 — accumulate into master weights

Python•Mixed-precision training with torch.cuda.amp
import torch
import torch.nn as nn

# Setup
model = nn.Linear(768, 768).cuda()
optim = torch.optim.AdamW(model.parameters(), lr=3e-4)

# GradScaler: manages loss scaling automatically for fp16
# For bf16, scaler is not needed (no overflow risk)
scaler = torch.cuda.amp.GradScaler()  # fp16 training

# Simulated training step
x = torch.randn(32, 768).cuda()
y = torch.randn(32, 768).cuda()

# --- fp16 training loop ---
optim.zero_grad()
with torch.cuda.amp.autocast(dtype=torch.float16):
    pred = model(x)            # forward in fp16
    loss = nn.functional.mse_loss(pred, y)
scaler.scale(loss).backward()       # scaled backward
scaler.unscale_(optim)              # unscale before clip
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optim)                 # update if no inf/nan
scaler.update()                    # adjust scale factor

# --- bf16 training loop (simpler: no scaler needed) ---
optim.zero_grad()
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    pred = model(x)
    loss = nn.functional.mse_loss(pred, y)
loss.backward()  # no scaling needed for bf16
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optim.step()

Loss Scaling: Why it Exists and How It Works

In fp16, small gradient values (< 6×10⁻⁸) underflow to zero and are lost. Loss scaling multiplies the loss by a large constant S before backward, scaling all gradients up by S. After backward, divide by S before updating weights. This rescues small gradients from underflow without changing the optimisation.

Python•Loss scaling — manual implementation to understand it
import torch

def training_step_with_manual_scaling(model, x, y, optim, scale=65536.0):
    """Manual loss scaling to rescue fp16 underflowed gradients."""
    optim.zero_grad()

    with torch.cuda.amp.autocast(dtype=torch.float16):
        loss = ((model(x) - y)**2).mean()

    # Scale loss BEFORE backward (scales all gradients by S)
    (loss * scale).backward()

    # Check for overflow in scaled gradients
    has_inf = any(p.grad.isinf().any() or p.grad.isnan().any()
              for p in model.parameters() if p.grad is not None)

    if has_inf:
        print("Overflow detected — skipping update, halving scale")
        optim.zero_grad()
        return scale / 2  # reduce scale

    # Unscale before weight update
    for p in model.parameters():
        if p.grad is not None:
            p.grad /= scale

    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optim.step()
    return min(scale * 2, 65536.0)  # increase scale if no overflow

▶

ML Connection: Dynamic Loss Scaling

PyTorch’s GradScaler implements dynamic loss scaling automatically. It starts with a large scale factor (default 65536) and doubles it every 2000 successful steps. When it detects overflow, it halves the scale and skips the weight update. This adaptive behaviour keeps the scale as large as possible (rescuing the most gradients from underflow) while avoiding overflow.

A training run is reproducible if running the same code with the same data produces the same result. Perfect reproducibility is often impossible in practice due to non-deterministic CUDA operations, but controllable seeds can get you 99% of the way there — which is essential for ablation studies and debugging.

Sources of Non-Determinism

Source	Why it varies	How to control
Python random	Different sequences across runs	random.seed(seed)
NumPy random	Different arrays across runs	np.random.seed(seed)
PyTorch CPU	Different tensor samples	torch.manual_seed(seed)
PyTorch CUDA	GPU non-deterministic ops	torch.cuda.manual_seed_all(seed)
cuDNN algorithms	Multiple valid implementations chosen at runtime	torch.backends.cudnn.deterministic=True
Data loading order	Shuffled differently each epoch	DataLoader with fixed generator seed
Floating-point ordering	Different thread interleaving in reductions	torch.use_deterministic_algorithms(True)

Python•Code Lab 5.2 — Fully reproducible training setup
import os, random, numpy as np, torch

def set_seed(seed: int, deterministic: bool = True):
    """
    Set all random seeds for reproducible training.
    deterministic=True: fully reproducible but ~10-20% slower.
    """
    # 1. Python built-in random
    random.seed(seed)

    # 2. NumPy
    np.random.seed(seed)

    # 3. PyTorch CPU and CUDA
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # for multi-GPU

    # 4. cuDNN determinism (trades speed for reproducibility)
    torch.backends.cudnn.deterministic = deterministic
    torch.backends.cudnn.benchmark    = not deterministic

    # 5. PyTorch deterministic algorithms (catches more ops)
    if deterministic:
        torch.use_deterministic_algorithms(True, warn_only=True)

    # 6. Environment variable for any remaining CUDA sources
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'

    print(f"All seeds set to {seed}. Deterministic: {deterministic}")

# Usage at the top of every training script
set_seed(42, deterministic=True)

# Verify: two runs with same seed produce identical outputs
set_seed(0)
out1 = torch.randn(3, 3)
set_seed(0)
out2 = torch.randn(3, 3)
print(torch.allclose(out1, out2))
# True -- identical

What to Log for Reproducibility

Even with seeds set, environment differences can cause variation. Always log:

•Exact commit hash of the codebase (git rev-parse HEAD)

•All hyperparameters (use a config file or argparse, logged to wandb/mlflow)

•Library versions: torch, transformers, numpy, cuda (torch.__version__, torch.version.cuda)

•Hardware: GPU model and count (nvidia-smi)

•Random seed used

•Dataset hash or version identifier

Numerical Stability Checklist

When you see NaN loss, inf gradients, or wildly wrong outputs, run through this checklist:

✔	Check	How to diagnose
1	Is the loss NaN from the very first step?	Check for log(0): add epsilon. Check for exp overflow: inspect logit magnitudes.
2	Does loss start fine then diverge?	Likely gradient explosion. Add gradient clipping. Check learning rate.
3	Are any weights inf or nan?	torch.isnan(param).any() after each step.
4	Is fp16 overflowing?	Switch to bf16 or lower the loss_scale. Check if scaler skips steps.
5	Is layer norm producing nan?	Activations may be all-zero (var=0). Add eps. Check for dead ReLU blocks.
6	Are gradients zero for some layers?	Gradient vanishing. Check activation functions, residual connections.
7	Does the model output the same thing regardless of input?	Model may be stuck. Check weight initialisation and learning rate.

Quick Reference: Stable Implementations

Operation	Unstable form	Stable form
Softmax	exp(z)/sum(exp(z))	exp(z-max)/sum(exp(z-max))
Log-softmax	log(softmax(z))	z - logsumexp(z)
Cross-entropy	-log(softmax(z)[y])	-(z[y] - logsumexp(z))
Log of sum	log(sum(exp(z)))	max(z) + log(sum(exp(z-max(z))))
Variance	E[X^2]-E[X]^2	Welford / (x-mean)^2.mean()
Layer norm	(x-E[X])/sqrt(E[X^2]-E[X]^2)	(x-mean)/sqrt(var(x-mean)+eps)
Log probability sum	log(prod(p))	sum(log(p))
Joint probability	prod(P(xi\|x<i))	exp(sum(log P(xi\|x<i)))

Exercises

Exercises 1–6 are pen-and-paper; 7–12 require code.

✎

Exercise 1: Pen & Paper

A float32 number is stored with 23 mantissa bits. What is the maximum relative rounding error (machine epsilon)? How does this compare to float64?

✎

Exercise 2: Pen & Paper

Show algebraically that subtracting the max from logits before softmax produces the same probabilities. (Hint: the max cancels out.)

✎

Exercise 3: Pen & Paper

Prove that logΣexp(z_i) = c + logΣexp(z_i-c) for any constant c. Why does c=max(z) give the best numerical stability?

✎

Exercise 4: Pen & Paper

A gradient is 5×10⁻⁹. In fp16 (min positive: 6×10⁻⁸), does it underflow? What loss scale S is needed to bring it above the fp16 minimum?

✎

Exercise 5: Pen & Paper

Derive Welford’s update equation: M_n = M_{n-1} + (x_n - μ_{n-1})(x_n - μ_n). Show it computes Σ(x_i - μ_n)².

✎

Exercise 6: Pen & Paper

The central difference formula has O(h²) error while the forward difference has O(h). For h=10⁻⁵, what is the expected relative error of each?

✎

Exercise 7: Code

Implement naive_softmax and stable_softmax. Find the exact logit magnitude at which naive_softmax first produces NaN in fp32 and fp16.

✎

Exercise 8: Code Lab 5.1

Apply gradient_check to a custom ReLU layer and a GELU layer. Verify both pass. Then introduce a sign error in dReLU and confirm the check catches it.

✎

Exercise 9: Code

Implement log_sum_exp from scratch. Verify it matches scipy.special.logsumexp on 10,000 random vectors including edge cases (all-inf, very large, very small).

✎

Exercise 10: Code

Demonstrate catastrophic cancellation: compute the variance of [10^8+1, 10^8+2, 10^8+3] in float32 using (a) naive E[X^2]-E[X]^2, (b) stable two-pass formula. Report error vs. the true value.

✎

Exercise 11: Code Lab 5.2

Run a 10-step training loop on a small MLP twice with the same seed. Verify outputs are bit-for-bit identical. Then enable non-deterministic algorithms and show they diverge.

✎

Exercise 12: Code (Challenge)

Profile memory usage: train a 100M-parameter linear model for one batch in fp32 vs. fp16 mixed precision. Measure peak GPU memory with torch.cuda.max_memory_allocated(). Report the ratio.

Further reading: “What Every Computer Scientist Should Know About Floating-Point Arithmetic” (Goldberg, 1991) — the definitive reference, freely available online. “Numerical Optimization” (Nocedal & Wright) Chapter 8 for gradient checking theory. The PyTorch documentation on AMP (Automatic Mixed Precision) and GradScaler.

You have now covered the complete mathematical toolkit for understanding every design decision, loss function, and training recipe in modern LLMs. The five chapters of Part I connect as follows:

Part II begins with classical machine learning: discriminative models, generative approaches, word embeddings, and RNNs. Every concept builds directly on the foundations established here — and the Transformer in Part III will feel like a natural evolution rather than an unexplained black box.

✎ 12 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 4. Information Theory

NextCh 6. Discriminative Models

→

Numerical Methods & Stability

Learning Objectives

Why Numbers Bite in ML

IEEE 754 Floating-Point Arithmetic

The Three Formats You’ll Use

Overflow, Underflow & Catastrophic Cancellation

Overflow: Too Big for the Format

Underflow: Too Small for the Format

Catastrophic Cancellation: Losing Significant Digits

The Log-Sum-Exp Trick

Derivation

LSE in Log-Likelihood Computations

Welford’s Online Algorithm

The Algorithm

Numerical Gradient Checking

Central Finite Differences

When Gradient Checks Fail

Mixed-Precision Training

The Mixed-Precision Workflow

Loss Scaling: Why it Exists and How It Works

Reproducibility & Random Seeds

Sources of Non-Determinism

What to Log for Reproducibility

Chapter Summary & Diagnostic Reference

Numerical Stability Checklist

Quick Reference: Stable Implementations

Exercises