Attention Mechanisms
Training a neural network means computing the gradient of a scalar loss with respect to millions or billions of parameters. The only mathematical tool required is the chain rule — but applied at this scale, it demands a systematic algorithm. That algorithm is backpropagation, and the machinery that automates it is automatic differentiation. By the end of this chapter you will have built both from scratch.
The Single-Variable Chain Rule
Recall the elementary chain rule: if y = f(u) and u = g(x), then dy/dx = (dy/du)(du/dx). Gradients multiply along a chain of dependencies. Backpropagation is this rule applied to a graph rather than a line.
Single chain: dy/dx = (dy/du)(du/dx)
Multivariate: ∂L/∂x = Σᵢ (∂L/∂uᵢ)(∂uᵢ/∂x) # sum over all paths
Vector form: ∇ₓ L = Jᵀ · ∇ᵤ L # J = Jacobian ∂u/∂xThe crucial generalization is the multivariate chain rule: when a variable x influences the loss through multiple intermediate paths, the gradients from all paths sum. This summation is the single most error-prone aspect of hand-written backprop — and the thing autograd handles automatically.
A computational graph represents a computation as nodes (variables and operations) connected by edges (data dependencies). It makes the structure of a computation explicit and is the data structure on which automatic differentiation operates.
A Worked Example
Consider the expression L = (a·b + c)². We can decompose it into elementary operations, each a node in the graph:
d = a * b # multiply node
e = d + c # add node
L = e * e # square nodeThe forward pass evaluates these in topological order (inputs first). The backward pass traverses them in reverse, applying the chain rule at each node. Here is the graph drawn as a stack from inputs (bottom) to loss (top):
Arch Stack: Computational graph for L = (a·b + c)²
| L = e · e | square |
| e = d + c | add |
| d = a · b | multiply |
| inputs: a, b, c | leaves |
Local Gradients at Each Node
Each operation knows only its own local gradient — how its output changes with its inputs. Backprop chains these together:
| Node | Forward | Local gradients |
|---|---|---|
| multiply d=a·b | d = a*b | ∂d/∂a = b, ∂d/∂b = a |
| add e=d+c | e = d+c | ∂e/∂d = 1, ∂e/∂c = 1 |
| square L=e² | L = e*e | ∂L/∂e = 2e |
Composing by the chain rule: ∂L/∂a = (∂L/∂e)(∂e/∂d)(∂d/∂a) = 2e · 1 · b = 2(a·b+c)·b. The backward pass computes exactly this product, but mechanically — each node multiplies the incoming gradient by its local gradient and passes it on.
Automatic differentiation comes in two flavors, distinguished by the direction in which they propagate derivatives through the graph. Understanding why deep learning uses reverse mode (backpropagation) reveals a deep efficiency argument.
| Forward-mode AD | Reverse-mode AD (backprop) |
|---|---|
| Propagates derivatives input → output | Propagates derivatives output → input |
| Computes ∂(all outputs)/∂(one input) | Computes ∂(one output)/∂(all inputs) |
| One pass per input variable | One pass per output variable |
| Efficient when few inputs, many outputs | Efficient when many inputs, one output |
| No need to store intermediate values | Must store forward activations |
| Used in: sensitivity analysis, Jacobians | Used in: ALL deep learning training |
Why Deep Learning Uses Reverse Mode
A neural network has one scalar output (the loss) and millions of inputs (the parameters). Forward mode would require one full pass per parameter — millions of passes. Reverse mode computes the gradient with respect to all parameters in a single backward pass. This asymmetry is decisive.
Forward mode: cost ∝ n passes (one per input)
Reverse mode: cost ∝ m passes (one per output)
Neural net training: n = 10⁹ params, m = 1 scalar loss
⇒ Reverse mode is ~10⁹× cheaper. This is why backprop won.Before automating backprop, we derive it by hand for a two-layer MLP. This grounds the autograd engine you will build next, and the vector-Jacobian products you derive here are exactly what each autograd node implements.
The Network
z₁ = X W₁ + b₁ # (N, H) pre-activation
a₁ = ReLU(z₁) # (N, H) hidden activation
z₂ = a₁ W₂ + b₂ # (N, K) logits
p = softmax(z₂) # (N, K) probabilities
L = CE(p, y) # scalar lossThe Backward Pass
We compute gradients in reverse, starting from the loss. The softmax-cross-entropy gradient (derived in Chapter 4) gives the elegant starting point ∂L/∂z₂ = (p − y)/N:
∂L/∂z₂ = (p - y) / N # (N, K) softmax-CE shortcut
∂L/∂W₂ = a₁ᵀ (∂L/∂z₂) # (H, K)
∂L/∂b₂ = Σ_rows (∂L/∂z₂) # (K,)
∂L/∂a₁ = (∂L/∂z₂) W₂ᵀ # (N, H)
∂L/∂z₁ = (∂L/∂a₁) ⊙ [z₁ > 0] # (N, H) ReLU mask
∂L/∂W₁ = Xᵀ (∂L/∂z₁) # (D, H)
∂L/∂b₁ = Σ_rows (∂L/∂z₁) # (H,)[Missing Component: shapeNote]
import numpy as np
def mlp_backward(X, y, cache):
"""y: integer labels. cache holds z1, a1, p from forward pass."""
z1, a1, p, W2 = cache
N = len(y)
# Output layer: softmax-CE gradient
dz2 = p.copy(); dz2[np.arange(N), y] -= 1; dz2 /= N # (N, K)
dW2 = a1.T @ dz2 # (H, K)
db2 = dz2.sum(0) # (K,)
# Hidden layer: backprop through ReLU
da1 = dz2 @ W2.T # (N, H)
dz1 = da1 * (z1 > 0) # ReLU mask
dW1 = X.T @ dz1 # (D, H)
db1 = dz1.sum(0) # (H,)
return dW1, db1, dW2, db2
# Verify against numerical gradient (Chapter 5's checker) before trusting it.
# This hand-derivation is what autograd will compute automatically.We now build a working autograd engine for scalars, in the spirit of Karpathy's micrograd. Each Value object holds a number, its gradient, and a reference to the function that computes its local backward pass. Operations build the graph automatically; a single backward() call propagates gradients through the whole graph.
class Value:
"""A scalar that tracks its gradient through a computational graph."""
def __init__(self, data, _children=(), _op=''):
self.data = data
self.grad = 0.0 # ∂L/∂self, accumulated
self._backward = lambda: None # local backward closure
self._prev = set(_children) # graph edges
self._op = _op
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
def _backward():
self.grad += out.grad # ∂(a+b)/∂a = 1
other.grad += out.grad # ∂(a+b)/∂b = 1
out._backward = _backward
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
def _backward():
self.grad += other.data * out.grad # ∂(ab)/∂a = b
other.grad += self.data * out.grad # ∂(ab)/∂b = a
out._backward = _backward
return out
def relu(self):
out = Value(max(0, self.data), (self,), 'ReLU')
def _backward():
self.grad += (out.data > 0) * out.grad # 1 if active
out._backward = _backward
return out
def backward(self):
# 1. Build topological order of all nodes behind self
topo, visited = [], set()
def build(v):
if v not in visited:
visited.add(v)
for child in v._prev: build(child)
topo.append(v)
build(self)
# 2. Seed the output gradient and backprop in reverse topo order
self.grad = 1.0
for v in reversed(topo): v._backward()Using the Engine
# Compute L = (a*b + c)^2 and its gradients automatically
a = Value(2.0); b = Value(-3.0); c = Value(10.0)
d = a * b # -6
e = d + c # 4
L = e * e # 16
L.backward()
print(f"L = {L.data}") # 16.0
print(f"dL/da = {a.grad}") # 2*e*b = 2*4*(-3) = -24
print(f"dL/db = {b.grad}") # 2*e*a = 2*4*2 = 16
print(f"dL/dc = {c.grad}") # 2*e*1 = 2*4 = 8
# All match the hand-derived gradients from Section 11.2. The engine works.The backward pass cannot visit nodes in arbitrary order. A node's gradient is only complete once all nodes that depend on it have contributed. Topological sorting guarantees this: process each node only after everything downstream of it has been processed.
Why Order Matters
Consider a node x used by two operations whose results both feed the loss. The gradient ∂L/∂x is the sum of contributions from both paths. If we processed x before both contributors had run, we would propagate an incomplete gradient downstream — corrupting every node below x.
# Phase 1: build topological order via DFS post-order
topo ← []; visited ← {}
function build(v):
if v not in visited:
visited.add(v)
for child in v.prev: build(child)
topo.append(v) # post-order: children before parent
build(loss)
# Phase 2: backprop in REVERSE topological order
loss.grad ← 1.0 # seed: ∂L/∂L = 1
for v in reversed(topo):
v._backward() # all downstream grads are readyThe Tape (Wengert List)
Reverse-mode AD frameworks record operations on a 'tape' (also called a Wengert list) during the forward pass. The tape is exactly the topological order: replaying it in reverse gives the backward pass. PyTorch builds this tape implicitly through the grad_fn attribute on each tensor.
Real networks operate on tensors, not scalars. Extending autograd to tensors requires computing vector-Jacobian products (VJPs) rather than scalar derivatives, and handling the subtleties of broadcasting. The key efficiency insight: we never materialize the full Jacobian — we compute its product with the incoming gradient directly.
The Vector-Jacobian Product
For a tensor operation y = f(x), the Jacobian ∂y/∂x can be enormous (for a 1000×1000 matmul it has a trillion entries). Reverse-mode AD never builds it: it computes the VJP, the product of the upstream gradient with the Jacobian, which usually has a simple closed form.
matmul Y = X W:
∂L/∂X = (∂L/∂Y) Wᵀ ∂L/∂W = Xᵀ (∂L/∂Y)
elementwise y = f(x):
∂L/∂x = (∂L/∂y) ⊙ f'(x)
sum s = Σ x:
∂L/∂x = (∂L/∂s) broadcast to shape of xBroadcasting and Gradient Reduction
When an operation broadcasts a smaller tensor against a larger one (e.g., adding a bias vector to a batch), the backward pass must sum the gradient over the broadcasted dimensions to restore the original shape. Forgetting this is a frequent source of shape-mismatch bugs.
import numpy as np
class Tensor:
def __init__(self, data, _children=(), _op=''):
self.data = np.asarray(data, dtype=float)
self.grad = np.zeros_like(self.data)
self._backward = lambda: None
self._prev = set(_children)
def __matmul__(self, other):
out = Tensor(self.data @ other.data, (self, other), '@')
def _backward():
self.grad += out.grad @ other.data.T # (∂L/∂Y) Wᵀ
other.grad += self.data.T @ out.grad # Xᵀ (∂L/∂Y)
out._backward = _backward
return out
def __add__(self, other):
out = Tensor(self.data + other.data, (self, other), '+')
def _backward():
self.grad += _unbroadcast(out.grad, self.data.shape)
other.grad += _unbroadcast(out.grad, other.data.shape)
out._backward = _backward
return out
def _unbroadcast(grad, shape):
"""Sum grad over broadcasted dims to match `shape`."""
while grad.ndim > len(shape): grad = grad.sum(0) # extra leading dims
for i, s in enumerate(shape):
if s == 1: grad = grad.sum(i, keepdims=True) # broadcast dim
return grad[Missing Component: shapeNote]
Two operations have backward passes subtle enough to derive carefully: softmax (where every output depends on every input) and LayerNorm (where the normalization couples all features). Both appear in every Transformer, and both are where hand-written backprop most often goes wrong.
Softmax Backward
The softmax Jacobian is dense: ∂pᵢ/∂zⱼ = pᵢ(δᵢⱼ − pⱼ). But the VJP — the product with the upstream gradient — simplifies beautifully:
p = softmax(z)
∂L/∂z = p ⊙ (∂L/∂p - Σⱼ (∂L/∂p)ⱼ pⱼ)
# In words: subtract the p-weighted mean of the upstream gradient,
# then scale by p. One pass, no Jacobian materialized.import numpy as np
def softmax_backward(dout, p):
"""dout: ∂L/∂p (N,K). p: softmax output (N,K). Returns ∂L/∂z."""
# Per-row: dz = p * (dout - sum(dout * p))
weighted = (dout * p).sum(axis=1, keepdims=True)
return p * (dout - weighted)
def layernorm_backward(dout, x, gamma, eps=1e-5):
"""Backward through y = gamma * (x - mu)/sqrt(var+eps)."""
N, D = x.shape
mu = x.mean(axis=1, keepdims=True)
var = x.var(axis=1, keepdims=True)
std = np.sqrt(var + eps)
xhat = (x - mu) / std
dgamma = (dout * xhat).sum(0)
dbeta = dout.sum(0)
# Gradient w.r.t. x (the coupled part -- derive carefully!)
dxhat = dout * gamma
dx = (1.0/D) / std * (
D * dxhat
- dxhat.sum(1, keepdims=True)
- xhat * (dxhat * xhat).sum(1, keepdims=True)
)
return dx, dgamma, dbeta
# ALWAYS verify these against the numerical gradient checker (Chapter 5).
# The LayerNorm backward is the single most bug-prone derivation in
# Transformer implementations -- the coupling through mu and var is subtle.Reverse-mode AD has a hidden cost: it must store every intermediate activation from the forward pass to compute gradients in the backward pass. For deep Transformers with long sequences, these stored activations dominate memory — often exceeding the memory for the parameters themselves.
The Activation Memory Problem
A Transformer with L layers stores activations for all L layers during the forward pass, holding them until the backward pass consumes them. Activation memory scales as O(L × batch × seq_len × d) — and for long contexts this becomes the binding constraint on model size.
| Strategy | Memory | Compute |
|---|---|---|
| Store all activations | O(L) — full | 1× forward (baseline) |
| Gradient checkpointing | O(√L) — sublinear | ~1.33× forward (recompute) |
| Recompute everything | O(1) — minimal | 2× forward (recompute all) |
Gradient checkpointing (also called activation checkpointing) stores activations only at a subset of layers — the checkpoints. During the backward pass, the activations between checkpoints are recomputed on the fly from the nearest checkpoint. Storing checkpoints every √L layers gives O(√L) memory at the cost of one extra forward pass per segment.
import torch
from torch.utils.checkpoint import checkpoint
# Without checkpointing: stores all activations (high memory)
def forward_normal(self, x):
for block in self.blocks:
x = block(x) # every activation kept for backward
return x
# With checkpointing: recompute activations during backward (low memory)
def forward_checkpointed(self, x):
for block in self.blocks:
x = checkpoint(block, x, use_reentrant=False)
# block's internal activations are NOT stored;
# they are recomputed when backward reaches this block
return x
# Trade-off: ~33% more compute for training, but enables much larger
# models / longer sequences on the same GPU. Standard for large LLM training.Backprop Quick-Reference
| Operation | Forward | VJP (backward) |
|---|---|---|
| add y=a+b | a + b | da = dy, db = dy (unbroadcast) |
| mul y=a⊙b | a * b | da = dy⊙b, db = dy⊙a |
| matmul Y=XW | X @ W | dX = dY@Wᵀ, dW = Xᵀ@dY |
| relu y=relu(x) | max(0,x) | dx = dy ⊙ [x>0] |
| softmax p | softmax(z) | dz = p⊙(dp - Σ(dp⊙p)) |
| sum s=Σx | x.sum() | dx = ds broadcast to x.shape |
| layernorm | (x-μ)/σ·γ+β | coupled — see Section 11.8 |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–18 require code.
Further reading: Andrej Karpathy's “micrograd” and his lecture “The spelled-out intro to neural networks and backpropagation” — the inspiration for this chapter's scalar engine. “Automatic Differentiation in Machine Learning: a Survey” (Baydin et al., 2018) for the full taxonomy. The PyTorch autograd documentation and the “Autograd mechanics” note. “Training Deep Nets with Sublinear Memory Cost” (Chen et al., 2016) for gradient checkpointing.
Next → Chapter 12: Attention Mechanisms
You can now compute gradients through any computational graph — the engine that trains every neural network. Chapter 12 builds the one component missing from your Transformer block: attention. We will start from the query-key-value memory lookup you met in Chapter 9, derive scaled dot-product attention from first principles, and build multi-head self-attention from scratch. This is the mechanism that defines the Transformer, and with your autograd understanding you will see exactly how gradients flow through it.