Part I: Mathematical Foundations

Chapter 2

Calculus & Optimization

Derivatives, gradients, chain rule, and backpropagation

16 Exercises

Learning Objectives

1.	Compute derivatives of any elementary function using differentiation rules.
2.	Apply the chain rule by hand through multi-step computational graphs.
3.	Understand gradient descent geometrically and implement it from scratch.
4.	Derive the full backpropagation algorithm from the chain rule alone.
5.	Implement a working scalar autograd engine capable of training a small neural network.
6.	Understand Adam, AdamW, and gradient clipping — the standard LLM training recipe.

Before we can train a neural network, we need to answer: “If I nudge this weight by a tiny amount, how much does the loss change?” That question is the derivative. Everything in this chapter flows from this single idea.

What is a Derivative?

Intuitively, the derivative of f at point x is the slope of the tangent line — how fast f is changing at that exact spot. Formally, it is the limit of the slope of a secant line as the two points collapse together:

f'(x) = lim_{h → 0} [f(x + h) - f(x)] / h

This formal definition is the basis of the numerical gradient check — a technique every ML engineer should know cold.

Python•Numerical derivative — the formal definition in code
# The formal limit definition of a derivative, implemented numerically

def numerical_derivative(f, x, h=1e-5):
    """Estimate f'(x) using the forward difference formula."""
    return (f(x + h) - f(x)) / h

def central_derivative(f, x, h=1e-5):
    """Central difference: more accurate (O(h^2) vs O(h) error)."""
    return (f(x + h) - f(x - h)) / (2 * h)

import math

# Test on known functions
f    = lambda x: x**3   # f(x) = x^3,  f'(x) = 3x^2
g    = lambda x: math.sin(x)  # f(x) = sin(x), f'(x) = cos(x)
x    = 2.0

print(f"f'(2) analytic: {3 * x**2:.6f}")          # 12.000000
print(f"f'(2) forward:  {numerical_derivative(f,x):.6f}") # 12.000030
print(f"f'(2) central:  {central_derivative(f,x):.6f}")  # 12.000000

print(f"g'(2) analytic: {math.cos(x):.6f}")        # -0.416147
print(f"g'(2) central:  {central_derivative(g,x):.6f}")  # -0.416147

Key Derivatives Every ML Practitioner Must Know

These functions appear as activation functions, loss components, and normalization factors. Their derivatives must become reflexes.

Function f(x)	Derivative f'(x)	Why it matters
x^n	n x^(n-1)	Polynomial layers
e^x	e^x	Exponentials, softmax numerator
ln(x)	1/x	Log-likelihood, cross-entropy
sigmoid: 1/(1+e^-x)	σ(x)(1 - σ(x))	Binary classification, LSTM gates
tanh(x)	1 - tanh²(x)	Older RNN activations
ReLU: max(0,x)	0 if x<0, 1 if x>0	Default activation, residual nets
GELU(x)	≈ σ(1.702x) f(x)	GPT, BERT activations
sin(x)	cos(x)	Positional encoding components

Python•Key activation derivatives — verify numerically
import numpy as np

def sigmoid(x): return 1 / (1 + np.exp(-x))
def d_sigmoid(x): return sigmoid(x) * (1 - sigmoid(x))  # elegant self-referential form
def relu(x):    return np.maximum(0, x)
def d_relu(x):   return (x > 0).astype(float)          # subgradient: 0 at x=0
def gelu(x):    return x * sigmoid(1.702 * x)             # fast approximation

x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])

# Numerical gradient check: compare analytic vs numerical derivative
h = 1e-5
numerical_d_sigmoid = (sigmoid(x + h) - sigmoid(x - h)) / (2 * h)
analytic_d_sigmoid   = d_sigmoid(x)
print(np.allclose(numerical_d_sigmoid, analytic_d_sigmoid, atol=1e-8))  # True

# Practical note: sigmoid derivative is maximised at x=0 (= 0.25)
# This means gradients are at most 0.25 per sigmoid layer → vanishing!
print(d_sigmoid(np.array([0.0])))  # [0.25]

▶

ML Connection: Why ReLU Replaced Sigmoid

Sigmoid's derivative is at most 0.25. In a 10-layer network, gradients shrink by at least (0.25)^10 ≈ 0.000001 — vanishing to nothing before reaching early layers.

ReLU's derivative is either 0 or 1. Gradients don't shrink through ReLU layers (they can die permanently if x < 0, but that's a different problem solved by Leaky ReLU/GELU). This is why deep networks only became trainable once ReLU was popularized around 2012.

Three rules cover 95% of the derivatives you will ever need to compute by hand in ML. The chain rule is by far the most important — it is the mathematical skeleton of backpropagation.

Product Rule

d/dx [f(x) g(x)] = f'(x) g(x) + f(x) g'(x)

Mnemonic: “derivative of first times second, plus first times derivative of second.”

Python•Product rule — applied to x^2 × sin(x)
import numpy as np

# h(x) = x^2 * sin(x)   →  h'(x) = 2x*sin(x) + x^2*cos(x)
h  = lambda x: x**2 * np.sin(x)
dh = lambda x: 2*x*np.sin(x) + x**2*np.cos(x)

x = np.array([0.5, 1.0, 1.5, 2.0])
numerical = (h(x + 1e-5) - h(x - 1e-5)) / (2e-5)
analytic  = dh(x)
print(np.allclose(numerical, analytic, atol=1e-8))  # True

The Chain Rule — The Heart of Backpropagation

The chain rule tells us how to differentiate a composition of functions. If z depends on y and y depends on x, then:

dz/dx = (dz/dy) × (dy/dx)

Extended to deeper chains: if z = fₙ(...f₂(f₁(x))...), then:

dz/dx = (dz/dfₙ) × (dfₙ/df_{n-1}) × ... × (df₂/df₁) × (df₁/dx)

✧

The Chain Rule IS Backpropagation

Backpropagation is not a separate algorithm someone invented. It is simply the chain rule applied systematically to a computational graph, starting from the loss and working backwards.

The “forward pass” caches all intermediate values. The “backward pass” walks the graph in reverse, multiplying local gradients together. That's it.

Python•Chain rule by hand — step by step
import numpy as np

# Forward computation: L = mean((sigmoid(W @ x + b) - y)^2)
# We will trace the backward pass manually through each step.

# --- Forward pass --- 
W = np.array([[0.3, -0.2, 0.5]])  # (1, 3)
x = np.array([1.0, 2.0, 3.0])    # (3,)  — input
b = np.array([0.1])              # bias
y = np.array([1.0])              # target

z    = W @ x + b                   # (1,)  linear pre-activation
a    = 1 / (1 + np.exp(-z))        # (1,)  sigmoid activation
diff = a - y                       # (1,)  residual
L    = np.mean(diff**2)            # scalar loss
print(f"Loss = {L:.4f}")

# --- Backward pass: chain rule applied step by step ---
dL_diff = 2 * diff / len(diff)    # dL/d(diff) = 2*(a-y)/N
dL_da   = dL_diff * 1             # d(diff)/da = 1
dL_dz   = dL_da * (a * (1 - a))  # da/dz = sigmoid derivative
dL_dW   = dL_dz[:, None] * x[None, :] # dz/dW = x  (outer product)
dL_db   = dL_dz                   # dz/db = 1
dL_dx   = W.T @ dL_dz             # dz/dx = W

print(f"dL/dW = {dL_dW}")
print(f"dL/db = {dL_db}")

Computational Graphs — Visualizing the Chain

A computational graph makes the chain rule mechanical. Every node stores its output. Every edge stores a local derivative. The backward pass multiplies these local derivatives along each path from the loss to the parameters.

Python•Computational graph traced manually
# Graph for: L = (sigmoid(w*x + b) - y)^2
# Nodes: w, x, b → n1=w*x → n2=n1+b → n3=sigmoid(n2) → n4=n3-y → L=n4^2

w, x, b, y = 0.5, 2.0, 0.1, 1.0

# Forward: compute and CACHE every intermediate
n1 = w * x                      # 1.0
n2 = n1 + b                     # 1.1
n3 = 1 / (1 + 2.718**(-n2))    # ≈0.750 (sigmoid)
n4 = n3 - y                     # -0.250
L  = n4**2                      # 0.0625

# Backward: chain rule, one node at a time
dL_n4 = 2 * n4                  # dL/dn4 = 2*(n3-y)
dL_n3 = dL_n4 * 1               # dn4/dn3 = 1
dL_n2 = dL_n3 * n3 * (1 - n3)   # dn3/dn2 = sigmoid'
dL_n1 = dL_n2 * 1               # dn2/dn1 = 1
dL_dw = dL_n1 * x               # dn1/dw = x
dL_db = dL_n2 * 1               # dn2/db = 1
print(f"dL/dw={dL_dw:.4f}  dL/db={dL_db:.4f}")

A neural network has millions of parameters. We need to know how the loss changes with respect to each one. That’s the gradient — a vector of partial derivatives.

Partial Derivatives

A partial derivative ∂f/∂xᵢ treats all variables except xᵢ as constants. The notation ∂ (curly d) signals “partial.”

∂f/∂xᵢ = lim_{h → 0} [f(x₁,..., xᵢ + h,..., xₙ) - f(x₁,..., xᵢ,..., xₙ)] / h

Python•Partial derivatives — computed per parameter
import numpy as np

# f(x, y) = 3x^2 + 2xy + y^3
# ∂f/∂x = 6x + 2y     ∂f/∂y = 2x + 3y^2

def f(x, y): return 3*x**2 + 2*x*y + y**3

x, y = 1.0, 2.0
h    = 1e-5

# Partial w.r.t. x: vary x, hold y fixed
df_dx_num = (f(x+h, y) - f(x-h, y)) / (2*h)
df_dx_ana = 6*x + 2*y              # analytic
print(f"∂f/∂x: numeric={df_dx_num:.4f} analytic={df_dx_ana:.4f}")  # 10.0

# Partial w.r.t. y: vary y, hold x fixed
df_dy_num = (f(x, y+h) - f(x, y-h)) / (2*h)
df_dy_ana = 2*x + 3*y**2           # analytic
print(f"∂f/∂y: numeric={df_dy_num:.4f} analytic={df_dy_ana:.4f}")  # 14.0

The Gradient — All Partials in a Vector

The gradient ∇f collects all partial derivatives into a single vector that points in the direction of steepest ascent of f in parameter space.

∇f(x) = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]ᵀ

Three geometric facts about the gradient that are essential for understanding training:

Python•Gradient: direction of steepest ascent
import numpy as np

# f(x, y) = x^2 + y^2   (a bowl; minimum at origin)
def f(params): return params[0]**2 + params[1]**2

def gradient(f, params, h=1e-5):
    """Compute gradient via central differences for all parameters."""
    grad = np.zeros_like(params)
    for i in range(len(params)):
        p_plus  = params.copy(); p_plus[i]  += h
        p_minus = params.copy(); p_minus[i] -= h
        grad[i] = (f(p_plus) - f(p_minus)) / (2*h)
    return grad

params = np.array([3.0, -4.0])
g      = gradient(f, params)
print(g)  # [6.  -8.]  (pointing away from minimum)

# Key property 1: gradient points UPHILL (steepest ascent)
# Key property 2: -gradient points DOWNHILL (gradient descent direction)
# Key property 3: gradient is ZERO at a minimum
print(gradient(f, np.array([0.0, 0.0])))  # [0. 0.]

The Jacobian — Gradients for Vector-Valued Functions

When f maps a vector to a vector (like a linear layer), the derivative is a matrix called the Jacobian. Each row is the gradient of one output with respect to all inputs.

J_{ij} = ∂f_i / ∂x_j shape: (dim_output, dim_input)

Python•Jacobian of a linear layer
import numpy as np

# f(x) = Wx   →  Jacobian = W  (because ∂(Wx)_i/∂x_j = W_ij)
W = np.array([[1.0, 2.0, 3.0],
               [4.0, 5.0, 6.0]])  # (2, 3) linear map

def numerical_jacobian(f, x, h=1e-5):
    f0   = f(x)
    J    = np.zeros((len(f0), len(x)))
    for j in range(len(x)):
        xp      = x.copy(); xp[j] += h
        xm      = x.copy(); xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2 * h)
    return J

x        = np.array([1.0, 2.0, 3.0])
f        = lambda x: W @ x
J_num    = numerical_jacobian(f, x)
print(np.allclose(J_num, W))  # True — Jacobian of linear layer = W

Gradient descent is the algorithm that actually trains neural networks. Its elegance is breathtaking: to minimize a loss, repeatedly move the parameters in the direction that most rapidly decreases it.

The Update Rule

θ ← θ - η ∇_θ L(θ)

Where: θ are the parameters, η (eta) is the learning rate (step size), and ∇_θ L is the gradient of the loss w.r.t. parameters. The negative sign ensures we move downhill (opposite the gradient).

Python•Gradient descent from scratch on a quadratic loss
import numpy as np

# Loss landscape: L(w) = (w - 3)^2   (minimum at w=3)
def loss(w): return (w - 3)**2
def dloss(w): return 2 * (w - 3)  # gradient

w  = 0.0          # starting point (far from minimum)
lr = 0.1          # learning rate

print(f"{'Step':>6} {'w':>8} {'L(w)':>10} {'grad':>10}")
for step in range(1, 21):
    grad = dloss(w)
    w    -= lr * grad    # UPDATE: move downhill
    if step % 5 == 0:
        print(f"{step:>6} {w:>8.4f} {loss(w):>10.6f} {grad:>10.4f}")
# Step      w      L(w)       grad
#     5  1.9221   1.168928  -2.0972
#    10  2.6510   0.120897  -0.6978
#    15  2.8825   0.013928  -0.2350
#    20  2.9575   0.001798  -0.0849

Learning Rate: The Most Important Hyperparameter

The learning rate η controls how large a step to take. Too large and we overshoot and diverge; too small and training takes forever. Finding the right value is one of the central skills of LLM training.

Python•Learning rate sensitivity
import numpy as np

def run_gd(lr, steps=30):
    w = 0.0
    history = []
    for _ in range(steps):
        w -= lr * 2 * (w - 3)  # gradient of (w-3)^2
        history.append((w - 3)**2)
    return history

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 4))
for lr, label in [(0.01, 'too small'), (0.1, 'good'), (0.9, 'too large'), (1.1, 'diverges')]:
    losses = run_gd(lr)
    ax.semilogy(losses, label=f"lr={lr} ({label})")
ax.set_xlabel('Step'); ax.set_ylabel('Loss (log scale)'); ax.legend(); plt.tight_layout()
plt.savefig('lr_sensitivity.png', dpi=150)  # plot the four trajectories

Stochastic & Mini-Batch Gradient Descent

In practice, computing the gradient over the entire dataset at every step is prohibitively expensive. We use subsets (“mini-batches”) instead, trading gradient accuracy for speed.

Variant	Batch size	Gradient	Tradeoff
Batch GD	Full dataset	Exact	Slow per step; no noise
SGD	1 sample	Very noisy	Fast; high variance
Mini-batch GD	16 – 4096	Approximate	GPU-efficient; best in practice

Python•Mini-batch SGD on a toy regression problem
import numpy as np

np.random.seed(42)
# Generate noisy linear data: y = 2x + 1 + noise
N      = 1000
X      = np.random.randn(N, 1)              # (1000, 1)
y      = 2 * X.squeeze() + 1 + 0.3 * np.random.randn(N)

# Learnable parameters
w = np.random.randn()  # weight (init random)
b = 0.0              # bias (init 0)
lr, batch_size = 0.01, 32

for epoch in range(5):
    idx  = np.random.permutation(N)    # shuffle data
    X_s, y_s = X[idx], y[idx]
    for i in range(0, N, batch_size):
        Xb = X_s[i:i+batch_size].squeeze()
        yb = y_s[i:i+batch_size]
        pred = w * Xb + b             # forward pass
        loss = np.mean((pred - yb)**2)  # MSE loss
        dw   = 2 * np.mean((pred-yb) * Xb) # gradient w.r.t. w
        db   = 2 * np.mean(pred - yb)    # gradient w.r.t. b
        w   -= lr * dw                 # update
        b   -= lr * db
    print(f"Epoch {epoch+1}: w={w:.3f}, b={b:.3f}, loss={loss:.4f}")
# w → 2.0, b → 1.0  (recovers true parameters)

Backpropagation is the most important algorithm in modern ML. Every framework — PyTorch, JAX, TensorFlow — is essentially a very fast, very careful implementation of this idea. In this section we build a working scalar autograd engine from scratch.

The Algorithm in Three Steps

1	Forward pass	Execute the computation, caching all intermediate values (activations, pre-activations, etc.) that will be needed during the backward pass.
2	Compute loss	Evaluate the loss function on the outputs. This is the scalar we differentiate with respect to all parameters.
3	Backward pass	Walk the computational graph in reverse topological order. At each node, apply the chain rule: accumulate the incoming gradient times the local derivative.

Code Lab 2.1 — Build a Scalar Autograd Engine

The following implementation is inspired by Andrej Karpathy’s micrograd. Each Value object wraps a scalar and tracks its gradient. Operations like + and * register a backward function that propagates gradients.

Python•Scalar autograd engine — full implementation
class Value:
    """A scalar value with automatic gradient tracking."""
    def __init__(self, data, _children=(), _op=''):
        self.data  = float(data)
        self.grad  = 0.0                   # gradient, initialised to 0
        self._prev = set(_children)
        self._op   = _op                   # operation that produced this node
        self._backward = lambda: None # default: leaf node

    def __add__(self, other):
        other  = other if isinstance(other, Value) else Value(other)
        out    = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad  += 1.0 * out.grad  # d(a+b)/da = 1
            other.grad += 1.0 * out.grad  # d(a+b)/db = 1
        out._backward = _backward
        return out

    def __mul__(self, other):
        other  = other if isinstance(other, Value) else Value(other)
        out    = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad  += other.data * out.grad # d(a*b)/da = b
            other.grad += self.data  * out.grad # d(a*b)/db = a
        out._backward = _backward
        return out

    def sigmoid(self):
        import math
        s   = 1 / (1 + math.exp(-self.data))
        out = Value(s, (self,), 'sigmoid')
        def _backward():
            self.grad += s * (1 - s) * out.grad  # sigmoid derivative
        out._backward = _backward
        return out

    def backward(self):
        """Topological sort + backward pass."""
        topo, visited = [], set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev: build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0              # dL/dL = 1 (seed)
        for node in reversed(topo): node._backward()

    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    __radd__ = __add__;  __rmul__ = __mul__

Python•Using the autograd engine to train a neuron
# Train a single neuron: y_hat = sigmoid(w1*x1 + w2*x2 + b)
# Target: learn AND gate  inputs=[0,0,0,1,1,0,1,1] targets=[0,0,0,1]

import random
random.seed(42)

data    = [([0,0], 0), ([0,1], 0), ([1,0], 0), ([1,1], 1)]
w1, w2, b = Value(random.gauss(0,1)), Value(random.gauss(0,1)), Value(0.0)

for epoch in range(200):
    total_loss = Value(0.0)
    for (x1, x2), target in data:
        pred = (w1 * x1 + w2 * x2 + b).sigmoid()
        loss = (pred + (-target)).sigmoid()  # rough BCE proxy
        total_loss = total_loss + loss
    # Zero gradients, backward, update
    for p in [w1, w2, b]: p.grad = 0.0
    total_loss.backward()
    for p in [w1, w2, b]: p.data -= 0.1 * p.grad
    if epoch % 50 == 0:
        print(f"Epoch {epoch}: loss={total_loss.data:.4f} w1={w1.data:.2f} w2={w2.data:.2f}")

✧

Why Zero Gradients Before Each Step?

Gradients in PyTorch (and our engine) accumulate by default — each .backward() call adds to existing gradients. This enables gradient accumulation over multiple mini-batches (useful when GPU memory limits batch size).

But for a standard training step, you must zero the gradients first, or each step will add to all previous gradients, producing wrong updates. This is why optimizer.zero_grad() is the first line of every PyTorch training loop.

In the early days of deep learning, training networks with more than ~4 layers was nearly impossible. The culprit: gradients either shrank to zero (vanishing) or grew to infinity (exploding) during backpropagation. Understanding this is essential for understanding every architectural decision in the Transformer.

The Root Cause: Repeated Multiplication

In a network of depth L, the gradient of the loss w.r.t. the first layer involves a product of L Jacobian matrices. If each has entries slightly less than 1, the product collapses to zero. If entries are slightly above 1, it explodes.

Python•Demonstrating vanishing gradients
import numpy as np

# Simulate gradient flowing back through L sigmoid layers
# sigmoid'(x) ≤ 0.25, so each layer multiplies gradient by ≤ 0.25

def simulate_gradient_flow(L, activation_deriv=0.25):
    """L layers, each multiplying gradient by activation_deriv."""
    grad = 1.0  # gradient at output layer
    for _ in range(L): grad *= activation_deriv
    return grad

print("Gradient magnitude at input layer:")
for L in [1, 2, 5, 10, 20, 50]:
    g_sig  = simulate_gradient_flow(L, 0.25)  # sigmoid
    g_relu = simulate_gradient_flow(L, 1.0)   # ReLU (best case)
    print(f"L={L:2d}: sigmoid={g_sig:.2e}  relu={g_relu:.1f}")
# L= 1: sigmoid=2.50e-01  relu=1.0
# L= 5: sigmoid=9.77e-04  relu=1.0
# L=10: sigmoid=9.54e-07  relu=1.0
# L=20: sigmoid=9.09e-13  relu=1.0
# L=50: sigmoid=≈10^-30   relu=1.0  ← completely useless

How Transformers Avoid This Problem

The Transformer architecture uses three structural choices that keep gradients healthy throughout training in very deep networks:

Technique	Why it helps
Residual connections	x → x + F(x). The gradient of the identity path is always 1, so gradients can bypass any problematic sub-network entirely. GPT-2 has 48 residual streams.
Layer normalization	Normalizes activations to zero mean and unit variance before each sub-layer. This prevents activations from growing exponentially and keeps gradient magnitudes in a reasonable range.
Gradient clipping	If \|\|g\|\|_2 > max_norm, scale g down to max_norm. Prevents exploding gradients from causing NaN loss. Standard in all LLM training runs (typically max_norm=1.0).

Raw gradient descent works, but it is slow on ill-conditioned loss landscapes (where some directions are steep, others flat). Modern optimizers add memory of past gradients and adaptive per-parameter step sizes to accelerate convergence dramatically.

Momentum — Accumulating History

Momentum keeps a running average of past gradients. It accelerates through flat regions and dampens oscillations in narrow valleys.

v ← βv + (1 - β)g θ ← θ - ηv

Python•SGD with Momentum from scratch
import numpy as np

class SGDMomentum:
    def __init__(self, params, lr=0.01, momentum=0.9):
        self.params   = params
        self.lr       = lr
        self.momentum = momentum
        self.velocity = [np.zeros_like(p) for p in params]

    def step(self, grads):
        for i, (p, g) in enumerate(zip(self.params, grads)):
            self.velocity[i] = self.momentum * self.velocity[i] + (1 - self.momentum) * g
            p               -= self.lr * self.velocity[i]

Adam — Adaptive Moments

Adam (Adaptive Moment Estimation) combines momentum (first moment) with an adaptive learning rate per parameter based on the magnitude of recent gradients (second moment). It is the most widely used optimizer in deep learning.

m ← β₁m + (1-β₁)g v ← β₂v + (1-β₂)g²

θ ← θ - η × m̂ / (√v̂ + ε) where m̂ = m/(1-β₁ᵗ), v̂ = v/(1-β₂ᵗ)

Python•Adam optimizer from scratch
import numpy as np

class Adam:
    def __init__(self, params, lr=3e-4, b1=0.9, b2=0.999, eps=1e-8):
        self.params = params
        self.lr     = lr
        self.b1, self.b2, self.eps = b1, b2, eps
        self.m = [np.zeros_like(p) for p in params]  # 1st moment
        self.v = [np.zeros_like(p) for p in params]  # 2nd moment
        self.t = 0                                         # step counter

    def step(self, grads):
        self.t += 1
        for i, (p, g) in enumerate(zip(self.params, grads)):
            self.m[i] = self.b1 * self.m[i] + (1 - self.b1) * g
            self.v[i] = self.b2 * self.v[i] + (1 - self.b2) * g**2
            m_hat = self.m[i] / (1 - self.b1**self.t)  # bias correction
            v_hat = self.v[i] / (1 - self.b2**self.t)
            p    -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

AdamW — Decoupled Weight Decay

Standard Adam applies L2 regularization through the gradient, which interacts badly with adaptive learning rates. AdamW decouples weight decay from the gradient update, making it more effective. This distinction matters:

Adam: g ← g + λw then update with g

AdamW: update w with g, then w ← w × (1 - ηλ)

Python•AdamW — the standard LLM training optimizer
import torch
import torch.nn as nn

# GPT-style training setup
model    = nn.Transformer(d_model=512, nhead=8)

# Standard AdamW hyperparameters for LLM training
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr          = 3e-4,   # starting LR (will be scheduled)
    betas       = (0.9, 0.95), # GPT-3 values (b2=0.95 not 0.999)
    eps         = 1e-8,
    weight_decay= 0.1,   # AdamW decoupled weight decay
)

# Training step with gradient clipping
loss = model(...).mean()          # forward pass
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)  # CRITICAL
optimizer.step()

Learning Rate Scheduling

A fixed learning rate is rarely optimal. Modern LLM training uses a schedule with three phases: linear warmup, cosine decay, and a floor.

Python•Cosine annealing with linear warmup
import math

def cosine_lr_with_warmup(step, warmup_steps, total_steps, max_lr, min_lr):
    """LR schedule used in GPT-3, LLaMA, and most large LLMs."""
    if step < warmup_steps:             # linear warmup phase
        return max_lr * step / warmup_steps
    if step > total_steps:             # after training (just in case)
        return min_lr
    # Cosine decay phase
    progress = (step - warmup_steps) / (total_steps - warmup_steps)
    coeff    = 0.5 * (1 + math.cos(math.pi * progress))  # 1.0 → 0.0
    return min_lr + coeff * (max_lr - min_lr)

# Typical values for a medium-sized LLM
schedule = [cosine_lr_with_warmup(s, warmup_steps=2000, total_steps=100000,
                                     max_lr=3e-4, min_lr=3e-5)
          for s in range(100001)]
print(f"Step 0: {schedule[0]:.2e}")       # 0.00e+00 (starts at 0)
print(f"Step 2000: {schedule[2000]:.2e}")  # 3.00e-04 (peak)
print(f"Step 100k: {schedule[-1]:.2e}")   # 3.00e-05 (floor)

▶

ML Connection: The Standard LLM Training Recipe

Every major LLM (GPT-3, LLaMA, Mistral, Gemma) uses this optimization recipe: AdamW with β₁=0.9, β₂=0.95, ε=1e-8; weight decay 0.1; cosine LR schedule with ~2000-step warmup; gradient clipping at max_norm=1.0.

The choice of β₂=0.95 (instead of the default 0.999) makes the second moment adapt faster, important for the non-stationary gradients seen during LLM training on diverse data.

Every time you implement a custom backward pass, a new loss function, or a novel architecture component, you should verify the gradients numerically. Gradient checking compares analytical gradients (from backprop) against numerical estimates (from finite differences). They should agree to within ~1e-5.

Python•Code Lab 2.2 — Universal gradient checker
import numpy as np

def gradient_check(loss_fn, params, h=1e-5, tol=1e-5, verbose=True):
    """
    Compare analytical gradients to numerical (central differences).
    loss_fn: callable that takes params list → (loss, list of analytic grads)
    params:  list of numpy arrays
    """
    _, analytic_grads = loss_fn(params)
    all_ok = True

    for pi, (p, ag) in enumerate(zip(params, analytic_grads)):
        num_grad = np.zeros_like(p)
        for idx in np.ndindex(p.shape):
            p_plus        = [q.copy() for q in params]
            p_minus       = [q.copy() for q in params]
            p_plus[pi][idx]  += h
            p_minus[pi][idx] -= h
            L_plus,  _ = loss_fn(p_plus)
            L_minus, _ = loss_fn(p_minus)
            num_grad[idx] = (L_plus - L_minus) / (2 * h)

        err = np.max(np.abs(ag - num_grad) / (np.abs(ag) + np.abs(num_grad) + 1e-8))
        ok  = err < tol
        if verbose:
            status = '✓ OK' if ok else '✗ FAIL'
            print(f"Param {pi}: max_rel_err={err:.2e} {status}")
        all_ok = all_ok and ok
    return all_ok

# Example: verify gradients of a custom loss
def my_loss(params):
    W, b = params
    x    = np.array([1.0, 2.0, 3.0])
    pred = np.tanh(W @ x + b)
    loss = float(np.mean(pred**2))
    # Analytic gradients
    dtanh = 1 - pred**2             # tanh derivative
    dL_dp = 2 * pred / len(pred)
    dL_dz = dL_dp * dtanh
    dW    = np.outer(dL_dz, x)
    db    = dL_dz
    return loss, [dW, db]

W0 = np.random.randn(2, 3); b0 = np.random.randn(2)
gradient_check(my_loss, [W0, b0])  # should print ✓ OK for both

Key Formulas Reference

Concept	Formula / Rule
Derivative definition	f'(x) = lim_{h→0} [f(x+h)-f(x)] / h
Chain rule	dz/dx = (dz/dy)(dy/dx)
Gradient	∇f = [∂f/∂x₁, ..., ∂f/∂xₙ]ᵀ
Gradient descent	θ ← θ - η∇L(θ)
Sigmoid derivative	σ'(x) = σ(x)(1 - σ(x))
Adam update	θ ← θ - η m̂/(√v̂ + ε)
Gradient clipping	if \|\|g\|\|>c: g ← g×c/\|\|g\|\|
Cosine LR warmup	lr = max_lr × (step/warmup) then cosine decay

Exercises

Exercises 1–10 are pen-and-paper; 11–16 require code. Complete all before Chapter 3.

✎

Exercise 1: Pen & Paper

Find the derivative of f(x) = x³ - 5x² + 3x + 7. At which x is it zero?

✎

Exercise 2: Pen & Paper

Differentiate g(x) = e^(sin(x²)). Show all chain rule steps.

✎

Exercise 3: Pen & Paper

Given f(x,y) = x²y + y³, find ∂f/∂x, ∂f/∂y, and the gradient at (2, 1).

✎

Exercise 4: Pen & Paper

Apply the chain rule to z = log(sigmoid(wx + b)). Find ∂z/∂w.

✎

Exercise 5: Pen & Paper

Derive the backward pass for f(a, b) = (a + b) × (a - b).

✎

Exercise 6: Pen & Paper

Show that the Adam update is invariant to rescaling of gradients (multiply g by a constant c). Why is this useful?

✎

Exercise 7: Pen & Paper

If a loss L = MSE on a linear model (L = (Wx-y)²/N), derive the gradient ∂L/∂W analytically.

✎

Exercise 8: Pen & Paper

Explain why gradient vanishing is worse for sigmoid than for ReLU. Use the derivative values.

✎

Exercise 9: Pen & Paper

In AdamW, weight decay multiplies weights by (1-ηλ) each step. After T steps without gradient update, what does the weight converge to?

✎

Exercise 10: Pen & Paper

Draw the computational graph for L = (a×b + c)² and write out the backward pass equations.

✎

Exercise 11: Code

Numerically verify ∂(softmax)/∂x_i for a 5-element vector. Compare to the analytic Jacobian: J_ij = softmax_i(δ_ij - softmax_j).

✎

Exercise 12: Code Lab 2.1

Extend the scalar autograd engine with: __sub__, __pow__, log(), tanh(). Verify each with gradient_check.

✎

Exercise 13: Code

Train a 2-layer MLP on XOR (4 samples) using your autograd engine. Report final loss and learned weights.

✎

Exercise 14: Code Lab 2.2

Apply gradient_check to a custom cross-entropy loss. Confirm gradients match PyTorch’s F.cross_entropy.

✎

Exercise 15: Code

Plot the gradient norm ||g||_2 over 1000 training steps on a simple MLP. Does it grow, shrink, or stabilize? Repeat with gradient clipping.

✎

Exercise 16: Code (Challenge)

Implement a full Adam optimizer from scratch and train a linear regression model. Compare loss curves to torch.optim.Adam on the same problem.

Further reading: Andrej Karpathy’s “micrograd” repository (github.com/karpathy/micrograd) — the inspiration for Code Lab 2.1. CS231n “Backpropagation Notes” (cs231n.stanford.edu). “Numerical Optimization” by Nocedal & Wright for a rigorous treatment of Adam and second-order methods.

Next: Chapter 3 — Probability & Statistics. We reframe training as maximum likelihood estimation and see why cross-entropy is the correct loss for language modeling.

✎ 16 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 1. Linear Algebra for ML

NextCh 3. Probability & Statistics

→

Calculus & Optimization

Learning Objectives

Functions, Limits & the Derivative

What is a Derivative?

Key Derivatives Every ML Practitioner Must Know

Differentiation Rules

Product Rule

The Chain Rule — The Heart of Backpropagation

Computational Graphs — Visualizing the Chain

Partial Derivatives & Gradients

Partial Derivatives

The Gradient — All Partials in a Vector

The Jacobian — Gradients for Vector-Valued Functions

Gradient Descent

The Update Rule

Learning Rate: The Most Important Hyperparameter

Stochastic & Mini-Batch Gradient Descent

Backpropagation from First Principles

The Algorithm in Three Steps

Code Lab 2.1 — Build a Scalar Autograd Engine

Vanishing & Exploding Gradients

The Root Cause: Repeated Multiplication

How Transformers Avoid This Problem

Optimization Beyond SGD

Momentum — Accumulating History

Adam — Adaptive Moments

AdamW — Decoupled Weight Decay

Learning Rate Scheduling

Gradient Checking

Chapter Summary & Exercises

Key Formulas Reference

Exercises