Numerical Methods & Stability
Every number in a neural network is a finite approximation of a real number. The gap between the real number you intend and the floating-point number you get is small but it compounds across billions of operations. Understanding this is not pedantry — it is the difference between a training run that converges and one that silently diverges.
This chapter covers five classes of numerical problem that every ML engineer encounters:
Every float in your GPU is stored as three fields: a sign bit, an exponent, and a mantissa (significand). The exponent controls the magnitude (range); the mantissa controls the precision. This trade-off between range and precision is the root cause of most numerical problems in ML.
The Three Formats You’ll Use
import torch
import numpy as np
# ---- Machine epsilon: gap between 1.0 and next representable float ----
print(f"fp32 epsilon: {np.finfo(np.float32).eps:.2e}")
print(f"fp64 epsilon: {np.finfo(np.float64).eps:.2e}")
# fp32 epsilon: 1.19e-07 fp64 epsilon: 2.22e-16
# ---- Max and min values ----
for dtype in [torch.float32, torch.float16, torch.bfloat16]:
info = torch.finfo(dtype)
print(f"{str(dtype):>20}: max={info.max:.2e} min_pos={info.tiny:.2e} eps={info.eps:.2e}")
# torch.float32: max=3.40e+38 min_pos=1.18e-38 eps=1.19e-07
# torch.float16: max=6.55e+04 min_pos=6.10e-05 eps=9.77e-04
# torch.bfloat16: max=3.39e+38 min_pos=1.18e-38 eps=7.81e-03
# ---- Demonstrate rounding error ----
a = np.float32(0.1)
b = np.float32(0.2)
print(f"0.1 + 0.2 == 0.3 ? {a + b == np.float32(0.3)}")
print(f"0.1 + 0.2 = {a + b:.20f}")
# False -- 0.30000001192092896000
# ---- fp16 overflow ----
x_fp16 = torch.tensor(65505.0, dtype=torch.float16)
print(f"65505 in fp16: {x_fp16}")
# inf -- overflowed!
x_bf16 = torch.tensor(65505.0, dtype=torch.bfloat16)
print(f"65505 in bf16: {x_bf16}")
# 65504.0 -- rounded but not infOverflow: Too Big for the Format
Overflow occurs when a computed value exceeds the maximum representable number. The result is ±inf. In neural networks, overflow most commonly occurs in: (1) the exponential in softmax for large logits, (2) fp16 gradients during backward passes, and (3) log-probability computations.
import numpy as np
def naive_softmax(z):
"""WRONG: overflows for large logits."""
return np.exp(z) / np.exp(z).sum()
# Small logits: fine
print(naive_softmax(np.array([1.0, 2.0, 3.0])))
# [0.09 0.245 0.665]
# Large logits: catastrophic overflow
big_logits = np.array([1000.0, 1001.0, 1002.0])
print(naive_softmax(big_logits))
# [nan nan nan] -- exp(1000) = inf, inf/inf = nan
# This happens in real training with large learning rates or bad init!
# fp16 overflows at exp(x) for x > ln(65504) ≈ 11.09
import torch
z_fp16 = torch.tensor([12.0, 1.0, 0.0], dtype=torch.float16)
print(torch.exp(z_fp16))
# tensor([inf, 2.7168, 1.], dtype=torch.float16) -- overflowed at 12!Underflow: Too Small for the Format
Underflow occurs when a positive value rounds to zero because it is smaller than the minimum representable positive number. In ML, this most commonly corrupts log-probability computations: log(0.0) = -inf.
import numpy as np
# Joint probability of a long sequence underflows to 0.0
token_probs = [0.1] * 300 # each token has 10% probability
joint_prob = np.prod(token_probs)
print(f"Joint prob (float64): {joint_prob}")
# 0.0 -- underflowed! True value is 10^{-300}
# Solution: work in log space
log_joint = np.sum(np.log(token_probs))
print(f"Log joint prob: {log_joint:.2f}")
# -690.78 -- accurate, no underflow
# Recover probability only when needed (and check it fits fp64)
if log_joint > np.log(np.finfo(float).tiny):
print(f"Prob: {np.exp(log_joint)}")
else:
print("Prob too small to represent in float64; stay in log space")Catastrophic Cancellation: Losing Significant Digits
When two nearly-equal numbers are subtracted, the result has far fewer significant digits than either operand. The relative error can be enormous even though the absolute error is small.
import numpy as np
# Computing variance as E[X^2] - E[X]^2 (one-pass, textbook formula)
def naive_variance(x):
"""DANGEROUS: catastrophic cancellation for large-mean data."""
n = len(x)
sum_x = sum(x)
sum_x2 = sum(xi**2 for xi in x)
return (sum_x2 - sum_x**2/n) / n
def stable_variance(x):
"""Safe: subtract mean first, then compute variance."""
mean = sum(x) / len(x)
return sum((xi - mean)**2 for xi in x) / len(x)
# Small mean: both fine
small = [1.0, 2.0, 3.0, 4.0, 5.0]
print(f"Naive: {naive_variance(small):.10f}")
print(f"Stable: {stable_variance(small):.10f}")
# Both give 2.0000000000 -- fine here
# Large mean: catastrophic cancellation in naive formula
big_mean = [1e8 + xi for xi in [1.0, 2.0, 3.0, 4.0, 5.0]]
print(f"\nNaive (big mean): {naive_variance(big_mean):.10f}")
print(f"Stable (big mean): {stable_variance(big_mean):.10f}")
print(f"True variance: {2.0:.10f}")
# Naive (big mean): 0.0000000000 WRONG! (should be 2.0)
# Stable (big mean): 2.0000000000 correctThe log-sum-exp (LSE) trick solves the overflow problem in softmax and log-likelihood computations with a mathematically equivalent transformation that shifts all exponentials into a safe range. Every serious ML library uses this internally.
Derivation
We want to compute log(Σ exp(z_i)) without overflow. The trick: subtract the maximum logit c = max(z) before exponentiating. Since we’re inside a log, we can add it back:
With this substitution, the largest argument to exp is exp(0) = 1. All other arguments are ≤ 0, so exp(⋅) ∈ (0, 1]. No overflow is possible.
import numpy as np
def log_sum_exp(z):
"""
Numerically stable log(sum(exp(z))).
Equivalent to scipy.special.logsumexp.
"""
c = z.max() # shift (safe subtraction)
return c + np.log(np.sum(np.exp(z - c)))
# ---- Test on large logits that would overflow naively ----
z = np.array([1000.0, 1001.0, 1002.0])
print(f"Naive: {np.log(np.sum(np.exp(z)))}")
# inf -- exp(1000) overflowed
print(f"Stable: {log_sum_exp(z):.4f}")
# 1002.4076 -- correct!
# ---- Verify against scipy ----
from scipy.special import logsumexp
print(np.isclose(log_sum_exp(z), logsumexp(z)))
# True
# ---- Application 1: stable softmax ----
def stable_softmax(z):
z = z - z.max() # shift by max (does not change probabilities)
e = np.exp(z)
return e / e.sum()
print(stable_softmax(np.array([1000.0, 1001.0, 1002.0])))
# [0.0900 0.2447 0.6652] -- correct, no overflow
# ---- Application 2: stable log-softmax ----
def stable_log_softmax(z):
"""log(softmax(z)) without computing softmax(z) first."""
return z - log_sum_exp(z)
z = np.array([2.0, 1.0, 0.5])
log_p1 = stable_log_softmax(z)
log_p2 = np.log(stable_softmax(z))
print(np.allclose(log_p1, log_p2))
# True -- identical results, but log_softmax is more numerically stable
# ---- Application 3: stable cross-entropy loss ----
def stable_cross_entropy(logits, target_class):
"""Cross-entropy without risk of overflow or log(0)."""
return -(logits[target_class] - log_sum_exp(logits))
# This is exactly what torch.nn.CrossEntropyLoss does internally
import torch, torch.nn.functional as F
logits_t = torch.tensor([1000.0, 1001.0, 1002.0])
target = 2
print(f"Our stable CE: {stable_cross_entropy(logits_t.numpy(), target):.4f}")
print(f"PyTorch CE: {F.cross_entropy(logits_t.unsqueeze(0), torch.tensor([target])):.4f}")
# Both: 0.4076 -- correct and stableLSE in Log-Likelihood Computations
When computing the log-likelihood of a sequence or marginalising over latent variables, log-sum-exp appears naturally anywhere you need to compute log of a sum of exponentials:
import numpy as np
from scipy.special import logsumexp
# Marginalising over K latent states
# log P(x) = log sum_k P(x|z=k) P(z=k)
# = log sum_k exp(log P(x|z=k) + log P(z=k))
# = logsumexp(log_likelihoods + log_priors)
K = 5 # number of latent states (e.g., mixture components)
log_p_z = np.log(np.array([0.3, 0.2, 0.2, 0.15, 0.15])) # log prior
log_p_x_given_z = np.array([-2.1, -3.5, -1.8, -4.2, -2.9]) # log likelihood
# Marginal log P(x): sum over all latent states
log_p_x = logsumexp(log_p_x_given_z + log_p_z)
print(f"log P(x) = {log_p_x:.4f}")
# Posterior: P(z|x) via Bayes (in log space, then softmax)
log_unnorm_post = log_p_x_given_z + log_p_z
log_post = log_unnorm_post - log_p_x # normalise
posterior = np.exp(log_post)
print(f"Posterior P(z|x): {np.round(posterior, 3)}")
print(f"Sum: {posterior.sum():.4f}")
# Sum: 1.0000 -- correctly normalisedLayer normalisation computes the mean and variance of activations in a single forward pass. Naïvely, variance is computed as E[X²] - E[X]², which suffers catastrophic cancellation (Section 5.3). Welford’s algorithm computes mean and variance in one pass without this problem.
The Algorithm
Welford’s key insight: instead of accumulating sum and sum-of-squares, accumulate the mean and a running sum of squared deviations from the current mean. This keeps all values close to each other in magnitude, avoiding cancellation.
import numpy as np
class WelfordStats:
"""Online computation of mean and variance using Welford's algorithm."""
def __init__(self):
self.n, self.mean, self.M = 0, 0.0, 0.0
def update(self, x):
self.n += 1
delta = x - self.mean
self.mean += delta / self.n # update mean
delta2 = x - self.mean # note: uses NEW mean
self.M += delta * delta2 # running sum of sq. deviations
@property
def variance(self): return self.M / self.n # population
@property
def sample_variance(self): return self.M / (self.n - 1) # unbiased
# ---- Test on large-mean data (naive formula fails here) ----
data = [1e8 + xi for xi in [1.0, 2.0, 3.0, 4.0, 5.0]]
w = WelfordStats()
for xi in data: w.update(xi)
print(f"Welford mean: {w.mean:.6f}")
print(f"Welford variance: {w.variance:.6f}")
print(f"NumPy variance: {np.var(data):.6f}")
print(f"True variance: 2.000000")
# Welford: 2.000000 NumPy: 2.000000 (both correct; naive formula gives 0)
# ---- Layer Norm using Welford-style stable computation ----
def layer_norm_stable(x, eps=1e-5):
"""
Numerically stable layer normalisation.
x: (*, D) tensor-like. Normalises over last dimension.
"""
mu = x.mean(axis=-1, keepdims=True) # stable: subtract mean first
var = ((x - mu)**2).mean(axis=-1, keepdims=True)
return (x - mu) / np.sqrt(var + eps)
acts = np.random.randn(4, 768) * 1e3 + 1e6 # large-mean activations
normed = layer_norm_stable(acts)
print(f"Mean after LN: {normed.mean(axis=-1).mean():.6f}")
print(f"Std after LN: {normed.std(axis=-1).mean():.6f}")
# Mean: ≈0.000000 Std: ≈1.000000Even experienced engineers make mistakes when implementing custom backward passes. Gradient checking is the gold-standard test: compute gradients analytically (via backprop) and numerically (via finite differences), then compare. Any discrepancy reveals a bug.
Central Finite Differences
The central difference formula has O(h²) error (vs. O(h) for forward differences), making it far more accurate for a given step size h:
import numpy as np
def gradient_check(loss_fn, params, h=1e-5, rtol=1e-4, verbose=True):
"""
Check analytical gradients against central finite differences.
loss_fn: callable (params) -> (scalar_loss, list_of_analytic_grads)
params: list of numpy arrays (the parameter tensors)
Returns: True if all gradients pass, False otherwise.
"""
_, analytic_grads = loss_fn(params)
all_passed = True
for pi, (p, ag) in enumerate(zip(params, analytic_grads)):
num_grad = np.zeros_like(p, dtype=np.float64)
for idx in np.ndindex(p.shape):
p_plus = [q.copy() for q in params]
p_minus = [q.copy() for q in params]
p_plus[pi][idx] += h
p_minus[pi][idx] -= h
L_p, _ = loss_fn(p_plus)
L_m, _ = loss_fn(p_minus)
num_grad[idx] = (L_p - L_m) / (2 * h)
# Relative error: |analytic - numeric| / (|analytic| + |numeric| + eps)
err = np.abs(ag - num_grad)
rel_err = (err / (np.abs(ag) + np.abs(num_grad) + 1e-8)).max()
ok = rel_err < rtol
all_passed = all_passed and ok
if verbose:
mark = '✔ PASS' if ok else '✘ FAIL'
print(f" Param {pi} {p.shape}: max_rel_err={rel_err:.2e} {mark}")
return all_passed
# ---- Example: attention score computation ----
def attention_loss(params):
"""Scaled dot-product attention loss for gradient checking."""
Q, K, V = params
d = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d) # (T, T)
# Stable softmax
scores -= scores.max(axis=-1, keepdims=True)
attn = np.exp(scores); attn /= attn.sum(axis=-1, keepdims=True)
out = attn @ V # (T, D)
loss = float(out.mean()**2) # scalar
# Backward pass
dout = 2 * out.mean() * np.ones_like(out) / out.size
dV = attn.T @ dout
dattn = dout @ V.T
# Softmax backward: ds = diag(attn) - attn^T attn (vectorised)
ds = attn * (dattn - (dattn * attn).sum(axis=-1, keepdims=True))
dQ = ds @ K / np.sqrt(d)
dK = ds.T @ Q / np.sqrt(d)
return loss, [dQ, dK, dV]
np.random.seed(0)
T, D = 4, 8
Q0 = np.random.randn(T, D)
K0 = np.random.randn(T, D)
V0 = np.random.randn(T, D)
print("Gradient check for attention:")
gradient_check(attention_loss, [Q0, K0, V0])
# ✔ PASS for all three (Q, K, V)When Gradient Checks Fail
A failing gradient check always indicates a bug. The most common causes:
Training large models in full fp32 wastes memory and compute. Mixed-precision training uses lower-precision formats (fp16 or bf16) for the forward pass and gradient computations, while maintaining a full-precision fp32 copy of the weights for the optimizer update. This halves memory usage and roughly doubles throughput on modern GPUs.
The Mixed-Precision Workflow
import torch
import torch.nn as nn
# Setup
model = nn.Linear(768, 768).cuda()
optim = torch.optim.AdamW(model.parameters(), lr=3e-4)
# GradScaler: manages loss scaling automatically for fp16
# For bf16, scaler is not needed (no overflow risk)
scaler = torch.cuda.amp.GradScaler() # fp16 training
# Simulated training step
x = torch.randn(32, 768).cuda()
y = torch.randn(32, 768).cuda()
# --- fp16 training loop ---
optim.zero_grad()
with torch.cuda.amp.autocast(dtype=torch.float16):
pred = model(x) # forward in fp16
loss = nn.functional.mse_loss(pred, y)
scaler.scale(loss).backward() # scaled backward
scaler.unscale_(optim) # unscale before clip
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optim) # update if no inf/nan
scaler.update() # adjust scale factor
# --- bf16 training loop (simpler: no scaler needed) ---
optim.zero_grad()
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
pred = model(x)
loss = nn.functional.mse_loss(pred, y)
loss.backward() # no scaling needed for bf16
nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optim.step()Loss Scaling: Why it Exists and How It Works
In fp16, small gradient values (< 6×10⁻⁸) underflow to zero and are lost. Loss scaling multiplies the loss by a large constant S before backward, scaling all gradients up by S. After backward, divide by S before updating weights. This rescues small gradients from underflow without changing the optimisation.
import torch
def training_step_with_manual_scaling(model, x, y, optim, scale=65536.0):
"""Manual loss scaling to rescue fp16 underflowed gradients."""
optim.zero_grad()
with torch.cuda.amp.autocast(dtype=torch.float16):
loss = ((model(x) - y)**2).mean()
# Scale loss BEFORE backward (scales all gradients by S)
(loss * scale).backward()
# Check for overflow in scaled gradients
has_inf = any(p.grad.isinf().any() or p.grad.isnan().any()
for p in model.parameters() if p.grad is not None)
if has_inf:
print("Overflow detected — skipping update, halving scale")
optim.zero_grad()
return scale / 2 # reduce scale
# Unscale before weight update
for p in model.parameters():
if p.grad is not None:
p.grad /= scale
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optim.step()
return min(scale * 2, 65536.0) # increase scale if no overflowA training run is reproducible if running the same code with the same data produces the same result. Perfect reproducibility is often impossible in practice due to non-deterministic CUDA operations, but controllable seeds can get you 99% of the way there — which is essential for ablation studies and debugging.
Sources of Non-Determinism
import os, random, numpy as np, torch
def set_seed(seed: int, deterministic: bool = True):
"""
Set all random seeds for reproducible training.
deterministic=True: fully reproducible but ~10-20% slower.
"""
# 1. Python built-in random
random.seed(seed)
# 2. NumPy
np.random.seed(seed)
# 3. PyTorch CPU and CUDA
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed) # for multi-GPU
# 4. cuDNN determinism (trades speed for reproducibility)
torch.backends.cudnn.deterministic = deterministic
torch.backends.cudnn.benchmark = not deterministic
# 5. PyTorch deterministic algorithms (catches more ops)
if deterministic:
torch.use_deterministic_algorithms(True, warn_only=True)
# 6. Environment variable for any remaining CUDA sources
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
print(f"All seeds set to {seed}. Deterministic: {deterministic}")
# Usage at the top of every training script
set_seed(42, deterministic=True)
# Verify: two runs with same seed produce identical outputs
set_seed(0)
out1 = torch.randn(3, 3)
set_seed(0)
out2 = torch.randn(3, 3)
print(torch.allclose(out1, out2))
# True -- identicalWhat to Log for Reproducibility
Even with seeds set, environment differences can cause variation. Always log:
Numerical Stability Checklist
When you see NaN loss, inf gradients, or wildly wrong outputs, run through this checklist:
Quick Reference: Stable Implementations
Exercises
Exercises 1–6 are pen-and-paper; 7–12 require code.
Further reading: “What Every Computer Scientist Should Know About Floating-Point Arithmetic” (Goldberg, 1991) — the definitive reference, freely available online. “Numerical Optimization” (Nocedal & Wright) Chapter 8 for gradient checking theory. The PyTorch documentation on AMP (Automatic Mixed Precision) and GradScaler.
You have now covered the complete mathematical toolkit for understanding every design decision, loss function, and training recipe in modern LLMs. The five chapters of Part I connect as follows:
Part II begins with classical machine learning: discriminative models, generative approaches, word embeddings, and RNNs. Every concept builds directly on the foundations established here — and the Transformer in Part III will feel like a natural evolution rather than an unexplained black box.