Calculus & Optimization
Before we can train a neural network, we need to answer: “If I nudge this weight by a tiny amount, how much does the loss change?” That question is the derivative. Everything in this chapter flows from this single idea.
What is a Derivative?
Intuitively, the derivative of f at point x is the slope of the tangent line — how fast f is changing at that exact spot. Formally, it is the limit of the slope of a secant line as the two points collapse together:
This formal definition is the basis of the numerical gradient check — a technique every ML engineer should know cold.
# The formal limit definition of a derivative, implemented numerically
def numerical_derivative(f, x, h=1e-5):
"""Estimate f'(x) using the forward difference formula."""
return (f(x + h) - f(x)) / h
def central_derivative(f, x, h=1e-5):
"""Central difference: more accurate (O(h^2) vs O(h) error)."""
return (f(x + h) - f(x - h)) / (2 * h)
import math
# Test on known functions
f = lambda x: x**3 # f(x) = x^3, f'(x) = 3x^2
g = lambda x: math.sin(x) # f(x) = sin(x), f'(x) = cos(x)
x = 2.0
print(f"f'(2) analytic: {3 * x**2:.6f}") # 12.000000
print(f"f'(2) forward: {numerical_derivative(f,x):.6f}") # 12.000030
print(f"f'(2) central: {central_derivative(f,x):.6f}") # 12.000000
print(f"g'(2) analytic: {math.cos(x):.6f}") # -0.416147
print(f"g'(2) central: {central_derivative(g,x):.6f}") # -0.416147Key Derivatives Every ML Practitioner Must Know
These functions appear as activation functions, loss components, and normalization factors. Their derivatives must become reflexes.
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))
def d_sigmoid(x): return sigmoid(x) * (1 - sigmoid(x)) # elegant self-referential form
def relu(x): return np.maximum(0, x)
def d_relu(x): return (x > 0).astype(float) # subgradient: 0 at x=0
def gelu(x): return x * sigmoid(1.702 * x) # fast approximation
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
# Numerical gradient check: compare analytic vs numerical derivative
h = 1e-5
numerical_d_sigmoid = (sigmoid(x + h) - sigmoid(x - h)) / (2 * h)
analytic_d_sigmoid = d_sigmoid(x)
print(np.allclose(numerical_d_sigmoid, analytic_d_sigmoid, atol=1e-8)) # True
# Practical note: sigmoid derivative is maximised at x=0 (= 0.25)
# This means gradients are at most 0.25 per sigmoid layer → vanishing!
print(d_sigmoid(np.array([0.0]))) # [0.25]Three rules cover 95% of the derivatives you will ever need to compute by hand in ML. The chain rule is by far the most important — it is the mathematical skeleton of backpropagation.
Product Rule
Mnemonic: “derivative of first times second, plus first times derivative of second.”
import numpy as np
# h(x) = x^2 * sin(x) → h'(x) = 2x*sin(x) + x^2*cos(x)
h = lambda x: x**2 * np.sin(x)
dh = lambda x: 2*x*np.sin(x) + x**2*np.cos(x)
x = np.array([0.5, 1.0, 1.5, 2.0])
numerical = (h(x + 1e-5) - h(x - 1e-5)) / (2e-5)
analytic = dh(x)
print(np.allclose(numerical, analytic, atol=1e-8)) # TrueThe Chain Rule — The Heart of Backpropagation
The chain rule tells us how to differentiate a composition of functions. If z depends on y and y depends on x, then:
Extended to deeper chains: if z = fₙ(...f₂(f₁(x))...), then:
import numpy as np
# Forward computation: L = mean((sigmoid(W @ x + b) - y)^2)
# We will trace the backward pass manually through each step.
# --- Forward pass ---
W = np.array([[0.3, -0.2, 0.5]]) # (1, 3)
x = np.array([1.0, 2.0, 3.0]) # (3,) — input
b = np.array([0.1]) # bias
y = np.array([1.0]) # target
z = W @ x + b # (1,) linear pre-activation
a = 1 / (1 + np.exp(-z)) # (1,) sigmoid activation
diff = a - y # (1,) residual
L = np.mean(diff**2) # scalar loss
print(f"Loss = {L:.4f}")
# --- Backward pass: chain rule applied step by step ---
dL_diff = 2 * diff / len(diff) # dL/d(diff) = 2*(a-y)/N
dL_da = dL_diff * 1 # d(diff)/da = 1
dL_dz = dL_da * (a * (1 - a)) # da/dz = sigmoid derivative
dL_dW = dL_dz[:, None] * x[None, :] # dz/dW = x (outer product)
dL_db = dL_dz # dz/db = 1
dL_dx = W.T @ dL_dz # dz/dx = W
print(f"dL/dW = {dL_dW}")
print(f"dL/db = {dL_db}")Computational Graphs — Visualizing the Chain
A computational graph makes the chain rule mechanical. Every node stores its output. Every edge stores a local derivative. The backward pass multiplies these local derivatives along each path from the loss to the parameters.
# Graph for: L = (sigmoid(w*x + b) - y)^2
# Nodes: w, x, b → n1=w*x → n2=n1+b → n3=sigmoid(n2) → n4=n3-y → L=n4^2
w, x, b, y = 0.5, 2.0, 0.1, 1.0
# Forward: compute and CACHE every intermediate
n1 = w * x # 1.0
n2 = n1 + b # 1.1
n3 = 1 / (1 + 2.718**(-n2)) # ≈0.750 (sigmoid)
n4 = n3 - y # -0.250
L = n4**2 # 0.0625
# Backward: chain rule, one node at a time
dL_n4 = 2 * n4 # dL/dn4 = 2*(n3-y)
dL_n3 = dL_n4 * 1 # dn4/dn3 = 1
dL_n2 = dL_n3 * n3 * (1 - n3) # dn3/dn2 = sigmoid'
dL_n1 = dL_n2 * 1 # dn2/dn1 = 1
dL_dw = dL_n1 * x # dn1/dw = x
dL_db = dL_n2 * 1 # dn2/db = 1
print(f"dL/dw={dL_dw:.4f} dL/db={dL_db:.4f}")A neural network has millions of parameters. We need to know how the loss changes with respect to each one. That’s the gradient — a vector of partial derivatives.
Partial Derivatives
A partial derivative ∂f/∂xᵢ treats all variables except xᵢ as constants. The notation ∂ (curly d) signals “partial.”
import numpy as np
# f(x, y) = 3x^2 + 2xy + y^3
# ∂f/∂x = 6x + 2y ∂f/∂y = 2x + 3y^2
def f(x, y): return 3*x**2 + 2*x*y + y**3
x, y = 1.0, 2.0
h = 1e-5
# Partial w.r.t. x: vary x, hold y fixed
df_dx_num = (f(x+h, y) - f(x-h, y)) / (2*h)
df_dx_ana = 6*x + 2*y # analytic
print(f"∂f/∂x: numeric={df_dx_num:.4f} analytic={df_dx_ana:.4f}") # 10.0
# Partial w.r.t. y: vary y, hold x fixed
df_dy_num = (f(x, y+h) - f(x, y-h)) / (2*h)
df_dy_ana = 2*x + 3*y**2 # analytic
print(f"∂f/∂y: numeric={df_dy_num:.4f} analytic={df_dy_ana:.4f}") # 14.0The Gradient — All Partials in a Vector
The gradient ∇f collects all partial derivatives into a single vector that points in the direction of steepest ascent of f in parameter space.
Three geometric facts about the gradient that are essential for understanding training:
import numpy as np
# f(x, y) = x^2 + y^2 (a bowl; minimum at origin)
def f(params): return params[0]**2 + params[1]**2
def gradient(f, params, h=1e-5):
"""Compute gradient via central differences for all parameters."""
grad = np.zeros_like(params)
for i in range(len(params)):
p_plus = params.copy(); p_plus[i] += h
p_minus = params.copy(); p_minus[i] -= h
grad[i] = (f(p_plus) - f(p_minus)) / (2*h)
return grad
params = np.array([3.0, -4.0])
g = gradient(f, params)
print(g) # [6. -8.] (pointing away from minimum)
# Key property 1: gradient points UPHILL (steepest ascent)
# Key property 2: -gradient points DOWNHILL (gradient descent direction)
# Key property 3: gradient is ZERO at a minimum
print(gradient(f, np.array([0.0, 0.0]))) # [0. 0.]The Jacobian — Gradients for Vector-Valued Functions
When f maps a vector to a vector (like a linear layer), the derivative is a matrix called the Jacobian. Each row is the gradient of one output with respect to all inputs.
import numpy as np
# f(x) = Wx → Jacobian = W (because ∂(Wx)_i/∂x_j = W_ij)
W = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]]) # (2, 3) linear map
def numerical_jacobian(f, x, h=1e-5):
f0 = f(x)
J = np.zeros((len(f0), len(x)))
for j in range(len(x)):
xp = x.copy(); xp[j] += h
xm = x.copy(); xm[j] -= h
J[:, j] = (f(xp) - f(xm)) / (2 * h)
return J
x = np.array([1.0, 2.0, 3.0])
f = lambda x: W @ x
J_num = numerical_jacobian(f, x)
print(np.allclose(J_num, W)) # True — Jacobian of linear layer = WGradient descent is the algorithm that actually trains neural networks. Its elegance is breathtaking: to minimize a loss, repeatedly move the parameters in the direction that most rapidly decreases it.
The Update Rule
Where: θ are the parameters, η (eta) is the learning rate (step size), and ∇_θ L is the gradient of the loss w.r.t. parameters. The negative sign ensures we move downhill (opposite the gradient).
import numpy as np
# Loss landscape: L(w) = (w - 3)^2 (minimum at w=3)
def loss(w): return (w - 3)**2
def dloss(w): return 2 * (w - 3) # gradient
w = 0.0 # starting point (far from minimum)
lr = 0.1 # learning rate
print(f"{'Step':>6} {'w':>8} {'L(w)':>10} {'grad':>10}")
for step in range(1, 21):
grad = dloss(w)
w -= lr * grad # UPDATE: move downhill
if step % 5 == 0:
print(f"{step:>6} {w:>8.4f} {loss(w):>10.6f} {grad:>10.4f}")
# Step w L(w) grad
# 5 1.9221 1.168928 -2.0972
# 10 2.6510 0.120897 -0.6978
# 15 2.8825 0.013928 -0.2350
# 20 2.9575 0.001798 -0.0849Learning Rate: The Most Important Hyperparameter
The learning rate η controls how large a step to take. Too large and we overshoot and diverge; too small and training takes forever. Finding the right value is one of the central skills of LLM training.
import numpy as np
def run_gd(lr, steps=30):
w = 0.0
history = []
for _ in range(steps):
w -= lr * 2 * (w - 3) # gradient of (w-3)^2
history.append((w - 3)**2)
return history
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 4))
for lr, label in [(0.01, 'too small'), (0.1, 'good'), (0.9, 'too large'), (1.1, 'diverges')]:
losses = run_gd(lr)
ax.semilogy(losses, label=f"lr={lr} ({label})")
ax.set_xlabel('Step'); ax.set_ylabel('Loss (log scale)'); ax.legend(); plt.tight_layout()
plt.savefig('lr_sensitivity.png', dpi=150) # plot the four trajectoriesStochastic & Mini-Batch Gradient Descent
In practice, computing the gradient over the entire dataset at every step is prohibitively expensive. We use subsets (“mini-batches”) instead, trading gradient accuracy for speed.
import numpy as np
np.random.seed(42)
# Generate noisy linear data: y = 2x + 1 + noise
N = 1000
X = np.random.randn(N, 1) # (1000, 1)
y = 2 * X.squeeze() + 1 + 0.3 * np.random.randn(N)
# Learnable parameters
w = np.random.randn() # weight (init random)
b = 0.0 # bias (init 0)
lr, batch_size = 0.01, 32
for epoch in range(5):
idx = np.random.permutation(N) # shuffle data
X_s, y_s = X[idx], y[idx]
for i in range(0, N, batch_size):
Xb = X_s[i:i+batch_size].squeeze()
yb = y_s[i:i+batch_size]
pred = w * Xb + b # forward pass
loss = np.mean((pred - yb)**2) # MSE loss
dw = 2 * np.mean((pred-yb) * Xb) # gradient w.r.t. w
db = 2 * np.mean(pred - yb) # gradient w.r.t. b
w -= lr * dw # update
b -= lr * db
print(f"Epoch {epoch+1}: w={w:.3f}, b={b:.3f}, loss={loss:.4f}")
# w → 2.0, b → 1.0 (recovers true parameters)Backpropagation is the most important algorithm in modern ML. Every framework — PyTorch, JAX, TensorFlow — is essentially a very fast, very careful implementation of this idea. In this section we build a working scalar autograd engine from scratch.
The Algorithm in Three Steps
Code Lab 2.1 — Build a Scalar Autograd Engine
The following implementation is inspired by Andrej Karpathy’s micrograd. Each Value object wraps a scalar and tracks its gradient. Operations like + and * register a backward function that propagates gradients.
class Value:
"""A scalar value with automatic gradient tracking."""
def __init__(self, data, _children=(), _op=''):
self.data = float(data)
self.grad = 0.0 # gradient, initialised to 0
self._prev = set(_children)
self._op = _op # operation that produced this node
self._backward = lambda: None # default: leaf node
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += 1.0 * out.grad # d(a+b)/da = 1
other.grad += 1.0 * out.grad # d(a+b)/db = 1
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad # d(a*b)/da = b
other.grad += self.data * out.grad # d(a*b)/db = a
out._backward = _backward
return out
def sigmoid(self):
import math
s = 1 / (1 + math.exp(-self.data))
out = Value(s, (self,), 'sigmoid')
def _backward():
self.grad += s * (1 - s) * out.grad # sigmoid derivative
out._backward = _backward
return out
def backward(self):
"""Topological sort + backward pass."""
topo, visited = [], set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._prev: build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1.0 # dL/dL = 1 (seed)
for node in reversed(topo): node._backward()
def __repr__(self):
return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
__radd__ = __add__; __rmul__ = __mul__# Train a single neuron: y_hat = sigmoid(w1*x1 + w2*x2 + b)
# Target: learn AND gate inputs=[0,0,0,1,1,0,1,1] targets=[0,0,0,1]
import random
random.seed(42)
data = [([0,0], 0), ([0,1], 0), ([1,0], 0), ([1,1], 1)]
w1, w2, b = Value(random.gauss(0,1)), Value(random.gauss(0,1)), Value(0.0)
for epoch in range(200):
total_loss = Value(0.0)
for (x1, x2), target in data:
pred = (w1 * x1 + w2 * x2 + b).sigmoid()
loss = (pred + (-target)).sigmoid() # rough BCE proxy
total_loss = total_loss + loss
# Zero gradients, backward, update
for p in [w1, w2, b]: p.grad = 0.0
total_loss.backward()
for p in [w1, w2, b]: p.data -= 0.1 * p.grad
if epoch % 50 == 0:
print(f"Epoch {epoch}: loss={total_loss.data:.4f} w1={w1.data:.2f} w2={w2.data:.2f}")In the early days of deep learning, training networks with more than ~4 layers was nearly impossible. The culprit: gradients either shrank to zero (vanishing) or grew to infinity (exploding) during backpropagation. Understanding this is essential for understanding every architectural decision in the Transformer.
The Root Cause: Repeated Multiplication
In a network of depth L, the gradient of the loss w.r.t. the first layer involves a product of L Jacobian matrices. If each has entries slightly less than 1, the product collapses to zero. If entries are slightly above 1, it explodes.
import numpy as np
# Simulate gradient flowing back through L sigmoid layers
# sigmoid'(x) ≤ 0.25, so each layer multiplies gradient by ≤ 0.25
def simulate_gradient_flow(L, activation_deriv=0.25):
"""L layers, each multiplying gradient by activation_deriv."""
grad = 1.0 # gradient at output layer
for _ in range(L): grad *= activation_deriv
return grad
print("Gradient magnitude at input layer:")
for L in [1, 2, 5, 10, 20, 50]:
g_sig = simulate_gradient_flow(L, 0.25) # sigmoid
g_relu = simulate_gradient_flow(L, 1.0) # ReLU (best case)
print(f"L={L:2d}: sigmoid={g_sig:.2e} relu={g_relu:.1f}")
# L= 1: sigmoid=2.50e-01 relu=1.0
# L= 5: sigmoid=9.77e-04 relu=1.0
# L=10: sigmoid=9.54e-07 relu=1.0
# L=20: sigmoid=9.09e-13 relu=1.0
# L=50: sigmoid=≈10^-30 relu=1.0 ← completely uselessHow Transformers Avoid This Problem
The Transformer architecture uses three structural choices that keep gradients healthy throughout training in very deep networks:
Raw gradient descent works, but it is slow on ill-conditioned loss landscapes (where some directions are steep, others flat). Modern optimizers add memory of past gradients and adaptive per-parameter step sizes to accelerate convergence dramatically.
Momentum — Accumulating History
Momentum keeps a running average of past gradients. It accelerates through flat regions and dampens oscillations in narrow valleys.
import numpy as np
class SGDMomentum:
def __init__(self, params, lr=0.01, momentum=0.9):
self.params = params
self.lr = lr
self.momentum = momentum
self.velocity = [np.zeros_like(p) for p in params]
def step(self, grads):
for i, (p, g) in enumerate(zip(self.params, grads)):
self.velocity[i] = self.momentum * self.velocity[i] + (1 - self.momentum) * g
p -= self.lr * self.velocity[i]Adam — Adaptive Moments
Adam (Adaptive Moment Estimation) combines momentum (first moment) with an adaptive learning rate per parameter based on the magnitude of recent gradients (second moment). It is the most widely used optimizer in deep learning.
import numpy as np
class Adam:
def __init__(self, params, lr=3e-4, b1=0.9, b2=0.999, eps=1e-8):
self.params = params
self.lr = lr
self.b1, self.b2, self.eps = b1, b2, eps
self.m = [np.zeros_like(p) for p in params] # 1st moment
self.v = [np.zeros_like(p) for p in params] # 2nd moment
self.t = 0 # step counter
def step(self, grads):
self.t += 1
for i, (p, g) in enumerate(zip(self.params, grads)):
self.m[i] = self.b1 * self.m[i] + (1 - self.b1) * g
self.v[i] = self.b2 * self.v[i] + (1 - self.b2) * g**2
m_hat = self.m[i] / (1 - self.b1**self.t) # bias correction
v_hat = self.v[i] / (1 - self.b2**self.t)
p -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)AdamW — Decoupled Weight Decay
Standard Adam applies L2 regularization through the gradient, which interacts badly with adaptive learning rates. AdamW decouples weight decay from the gradient update, making it more effective. This distinction matters:
import torch
import torch.nn as nn
# GPT-style training setup
model = nn.Transformer(d_model=512, nhead=8)
# Standard AdamW hyperparameters for LLM training
optimizer = torch.optim.AdamW(
model.parameters(),
lr = 3e-4, # starting LR (will be scheduled)
betas = (0.9, 0.95), # GPT-3 values (b2=0.95 not 0.999)
eps = 1e-8,
weight_decay= 0.1, # AdamW decoupled weight decay
)
# Training step with gradient clipping
loss = model(...).mean() # forward pass
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # CRITICAL
optimizer.step()Learning Rate Scheduling
A fixed learning rate is rarely optimal. Modern LLM training uses a schedule with three phases: linear warmup, cosine decay, and a floor.
import math
def cosine_lr_with_warmup(step, warmup_steps, total_steps, max_lr, min_lr):
"""LR schedule used in GPT-3, LLaMA, and most large LLMs."""
if step < warmup_steps: # linear warmup phase
return max_lr * step / warmup_steps
if step > total_steps: # after training (just in case)
return min_lr
# Cosine decay phase
progress = (step - warmup_steps) / (total_steps - warmup_steps)
coeff = 0.5 * (1 + math.cos(math.pi * progress)) # 1.0 → 0.0
return min_lr + coeff * (max_lr - min_lr)
# Typical values for a medium-sized LLM
schedule = [cosine_lr_with_warmup(s, warmup_steps=2000, total_steps=100000,
max_lr=3e-4, min_lr=3e-5)
for s in range(100001)]
print(f"Step 0: {schedule[0]:.2e}") # 0.00e+00 (starts at 0)
print(f"Step 2000: {schedule[2000]:.2e}") # 3.00e-04 (peak)
print(f"Step 100k: {schedule[-1]:.2e}") # 3.00e-05 (floor)Every time you implement a custom backward pass, a new loss function, or a novel architecture component, you should verify the gradients numerically. Gradient checking compares analytical gradients (from backprop) against numerical estimates (from finite differences). They should agree to within ~1e-5.
import numpy as np
def gradient_check(loss_fn, params, h=1e-5, tol=1e-5, verbose=True):
"""
Compare analytical gradients to numerical (central differences).
loss_fn: callable that takes params list → (loss, list of analytic grads)
params: list of numpy arrays
"""
_, analytic_grads = loss_fn(params)
all_ok = True
for pi, (p, ag) in enumerate(zip(params, analytic_grads)):
num_grad = np.zeros_like(p)
for idx in np.ndindex(p.shape):
p_plus = [q.copy() for q in params]
p_minus = [q.copy() for q in params]
p_plus[pi][idx] += h
p_minus[pi][idx] -= h
L_plus, _ = loss_fn(p_plus)
L_minus, _ = loss_fn(p_minus)
num_grad[idx] = (L_plus - L_minus) / (2 * h)
err = np.max(np.abs(ag - num_grad) / (np.abs(ag) + np.abs(num_grad) + 1e-8))
ok = err < tol
if verbose:
status = '✓ OK' if ok else '✗ FAIL'
print(f"Param {pi}: max_rel_err={err:.2e} {status}")
all_ok = all_ok and ok
return all_ok
# Example: verify gradients of a custom loss
def my_loss(params):
W, b = params
x = np.array([1.0, 2.0, 3.0])
pred = np.tanh(W @ x + b)
loss = float(np.mean(pred**2))
# Analytic gradients
dtanh = 1 - pred**2 # tanh derivative
dL_dp = 2 * pred / len(pred)
dL_dz = dL_dp * dtanh
dW = np.outer(dL_dz, x)
db = dL_dz
return loss, [dW, db]
W0 = np.random.randn(2, 3); b0 = np.random.randn(2)
gradient_check(my_loss, [W0, b0]) # should print ✓ OK for bothKey Formulas Reference
Exercises
Exercises 1–10 are pen-and-paper; 11–16 require code. Complete all before Chapter 3.
Further reading: Andrej Karpathy’s “micrograd” repository (github.com/karpathy/micrograd) — the inspiration for Code Lab 2.1. CS231n “Backpropagation Notes” (cs231n.stanford.edu). “Numerical Optimization” by Nocedal & Wright for a rigorous treatment of Adam and second-order methods.
Next: Chapter 3 — Probability & Statistics. We reframe training as maximum likelihood estimation and see why cross-entropy is the correct loss for language modeling.