Part II: Classical ML & Representations
Chapter 10

Tokenization

BPE, WordPiece, SentencePiece — the vocabulary problem
20 Exercises
10.1

Every neural network, including the largest Transformers, is built from a single repeated unit: the artificial neuron. Its simplest form, the perceptron (Rosenblatt, 1958), computes a weighted sum of its inputs, adds a bias, and applies a step function. Understanding its capabilities — and its famous failure — is the foundation for everything that follows.

The Perceptron Model

y = step(w · x + b) where step(z) = 1 if z ≥ 0, else 0

The perceptron is a linear classifier: w · x + b = 0 defines a hyperplane, and the step function assigns each side to a class. It is exactly the logistic regression of Chapter 6 with a hard threshold replacing the sigmoid.

History: The Perceptron Hype and the AI Winter
Rosenblatt's perceptron generated enormous excitement — the New York Times reported in 1958 that the Navy expected it to 'walk, talk, see, write, reproduce itself and be conscious of its existence.'
In 1969, Minsky and Papert proved the single-layer perceptron cannot learn the XOR function. This result, widely (mis)interpreted as a fatal limitation of neural networks, contributed to the first 'AI winter' — a decade of reduced funding. The irony: multi-layer perceptrons solve XOR easily, but no one knew how to train them until backpropagation was popularised in 1986.

The XOR Problem

XOR (exclusive or) outputs 1 when exactly one input is 1. The four points (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 cannot be separated by any single straight line — they are not linearly separable. A single perceptron fails. This is the limitation that motivates depth.

PythonThe perceptron and its XOR failure
import numpy as np

class Perceptron:
    def __init__(self, dim, lr=0.1):
        self.w = np.zeros(dim); self.b = 0.0; self.lr = lr

    def predict(self, x): return int(x @ self.w + self.b >= 0)

    def fit(self, X, y, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                pred = self.predict(xi)
                # Perceptron learning rule: update on mistakes
                self.w += self.lr * (yi - pred) * xi
                self.b += self.lr * (yi - pred)

# Linearly separable AND gate: works perfectly
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y_and = np.array([0,0,0,1])
p = Perceptron(2); p.fit(X, y_and)
print("AND:", [p.predict(x) for x in X])  # [0, 0, 0, 1] correct

# XOR: never converges -- not linearly separable
y_xor = np.array([0,1,1,0])
p2 = Perceptron(2); p2.fit(X, y_xor, epochs=1000)
print("XOR:", [p2.predict(x) for x in X])  # wrong, no matter how long
# A single perceptron CANNOT represent XOR. We need a hidden layer.
10.2

The fix for XOR — and for every non-linearly-separable problem — is to stack layers of neurons, separated by nonlinear activation functions. A multi-layer perceptron (MLP) transforms its input through a sequence of linear projections and nonlinearities, learning a representation in which the problem becomes separable.

The MLP Forward Pass

textTwo-layer MLP
h  =  σ(W₁ x + b₁)        # hidden layer  (D → H)
y  =  W₂ h + b₂           # output layer  (H → K)

σ = nonlinear activation (ReLU, GELU, …)

The nonlinearity σ is essential. Without it, stacking linear layers collapses to a single linear layer: W₂(W₁ x) = (W₂ W₁) x. The nonlinearity is what gives depth its power — each layer can warp the space so the next layer's linear boundary becomes useful.

Intuition: Why Depth Solves XOR
The hidden layer transforms the four XOR points into a new space. One hidden neuron can learn 'OR' and another can learn 'AND'; the output neuron then computes 'OR AND NOT-AND' = XOR. The hidden representation makes the problem linearly separable.
This is the master pattern of all deep learning: early layers learn a representation in which the final layer's simple (often linear) decision becomes easy. A 100-layer Transformer is this same idea, scaled up enormously.

The Universal Approximation Theorem

Cybenko (1989) and Hornik (1991) proved that an MLP with a single hidden layer and a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary accuracy. This is the theoretical guarantee that neural networks are expressive enough for any task.

⚠️
The Universal Approximation Caveat
Universal approximation says a wide-enough shallow network CAN represent any function — it does not say such a network is easy to find by gradient descent, nor that the required width is practical. A shallow network might need exponentially many neurons where a deep network needs only polynomially many (Telgarsky, 2016).
This is the formal justification for depth: deep networks represent certain functions exponentially more efficiently than shallow ones. Depth is not just expressiveness — it is efficiency.
PythonMLP solving XOR from scratch
import numpy as np

np.random.seed(0)
X = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y = np.array([[0],[1],[1],[0]], dtype=float)

# 2 -> 4 -> 1 MLP with ReLU hidden and sigmoid output
W1 = np.random.randn(2, 4) * 0.5; b1 = np.zeros(4)
W2 = np.random.randn(4, 1) * 0.5; b2 = np.zeros(1)

def sigmoid(z): return 1 / (1 + np.exp(-z))

for epoch in range(5000):
    # Forward
    h    = np.maximum(0, X @ W1 + b1)       # ReLU hidden
    out  = sigmoid(h @ W2 + b2)              # sigmoid output
    # Backward (binary cross-entropy)
    dout = (out - y) / len(X)                # dL/d(out·sig)
    dW2  = h.T @ dout;  db2 = dout.sum(0)
    dh   = (dout @ W2.T) * (h > 0)         # ReLU gradient
    dW1  = X.T @ dh;   db1 = dh.sum(0)
    # Update
    for p, g in [(W1,dW1),(b1,db1),(W2,dW2),(b2,db2)]: p -= 0.5 * g

preds = (sigmoid(np.maximum(0, X@W1+b1)@W2+b2) > 0.5).astype(int)
print("XOR solved:", preds.ravel())  # [0 1 1 0] -- correct!
# The hidden layer made XOR linearly separable for the output neuron.
10.3

The activation function is the single most consequential architectural choice after depth. It determines gradient flow, training stability, and representational capacity. The history of deep learning is partly a history of better activation functions.

ActivationFormulaRangeUsed in
Sigmoid1/(1+e⁻ˣ)(0,1)Output gates, binary classification head
Tanh(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)(-1,1)Old RNNs, LSTM gates
ReLUmax(0, x)[0,∞)CNNs, default hidden activation since 2012
Leaky ReLUmax(0.01x, x)(-∞,∞)Avoids dead neurons
GELUx·Φ(x)(-0.17,∞)BERT, GPT-2/3, most Transformers
SiLU/Swishx·σ(x)(-0.28,∞)EfficientNet, some LLMs
SwiGLUSwish(xW)⊙(xV)(-∞,∞)LLaMA, PaLM, modern LLM FFN

The ReLU Revolution

ReLU (Nair & Hinton, 2010; popularised by AlexNet, 2012) replaced sigmoid/tanh in hidden layers and made deep networks trainable. Its derivative is 1 for positive inputs — gradients pass through unattenuated, defeating the vanishing gradient problem of Chapter 9. It is also trivially cheap to compute.

textReLU and its gradient
ReLU(x)  =  max(0, x)
ReLU'(x) =  1 if x > 0,  else 0      # no attenuation for active units
⚠️
Pitfall: The Dying ReLU Problem
A ReLU neuron whose pre-activation is always negative outputs 0 and has zero gradient — it can never recover and is effectively dead. With a bad initialization or too-high learning rate, a large fraction of neurons can die, permanently reducing model capacity.
Fixes: Leaky ReLU (small negative slope keeps gradient alive), careful initialization (He init), and lower learning rates. GELU also avoids the hard zero, smoothly approaching zero for negative inputs.

GELU and SwiGLU: The Transformer Activations

Modern Transformers use smooth activations. GELU (Gaussian Error Linear Unit) weights inputs by the probability they are positive under a Gaussian — a smooth, differentiable-everywhere alternative to ReLU. SwiGLU, a gated variant, powers the feed-forward networks of LLaMA and PaLM.

textGELU and SwiGLU
GELU(x)   =  x · Φ(x)  ≈  0.5x(1 + tanh[√(2/π)(x + 0.044715x³)])

SwiGLU(x) =  Swish(xW + b) ⊙ (xV + c)     # gated, 3 matrices
Swish(x)  =  x · σ(βx)
PythonAll activation functions and their gradients
import numpy as np

def sigmoid(x):  s = 1/(1+np.exp(-x)); return s, s*(1-s)
def tanh(x):     t = np.tanh(x);       return t, 1-t**2
def relu(x):                          return np.maximum(0,x), (x>0).astype(float)
def leaky_relu(x, a=0.01):       return np.where(x>0,x,a*x), np.where(x>0,1,a)

def gelu(x):  # tanh approximation (used in GPT-2)
    c = np.sqrt(2/np.pi)
    inner = c*(x + 0.044715*x**3)
    val = 0.5*x*(1+np.tanh(inner))
    # derivative omitted for brevity; use autograd in practice
    return val

def swish(x, beta=1.0):  return x*sigmoid(beta*x)[0]

# Key property: max gradient of each activation
x = np.linspace(-5, 5, 1000)
print(f"sigmoid max grad: {sigmoid(x)[1].max():.3f}")  # 0.25 -> vanishing
print(f"tanh    max grad: {tanh(x)[1].max():.3f}")     # 1.00 -> better
print(f"relu    max grad: {relu(x)[1].max():.3f}")     # 1.00 -> no attenuation
ML Connection: Why LLaMA Uses SwiGLU
The feed-forward network in each Transformer block is where most parameters live (often 2/3 of the model). LLaMA and PaLM replaced the standard ReLU/GELU FFN with SwiGLU, which empirically improves perplexity at equal parameter count.
SwiGLU uses three weight matrices instead of two; to keep parameter count constant the hidden dimension is reduced from 4d to (8/3)d. You will implement this exact FFN in Chapter 13.
10.4

Before training begins, the weights must be set to something. This choice is not a minor detail: initialize too large and activations explode; too small and they vanish; both make the network untrainable. Proper initialization keeps the variance of activations and gradients stable across all layers.

The Variance Propagation Argument

Consider a linear layer y = Wx with fan-in n. If each weight has variance Var(W) and inputs have variance Var(x), then each output has variance n·Var(W)·Var(x). To keep Var(y) = Var(x), we need Var(W) = 1/n. This is the core insight behind all principled initialization schemes.

textVariance-preserving initialization
Var(y_i) = n · Var(W) · Var(x)        # forward pass variance

Xavier/Glorot:  Var(W) = 2/(n_in + n_out)    # balances fwd & bwd, for tanh
He/Kaiming:     Var(W) = 2/n_in              # accounts for ReLU zeroing half

Xavier initialization (Glorot & Bengio, 2010) balances forward and backward variance for symmetric activations like tanh. He initialization (He et al., 2015) adds a factor of 2 to compensate for ReLU setting half the activations to zero, halving the variance.

PythonInitialization schemes and their effect on signal propagation
import numpy as np

def test_init(init_fn, depth=50, width=256):
    """Pass a signal through `depth` ReLU layers; track activation std."""
    x = np.random.randn(1000, width)  # batch of 1000
    stds = [x.std()]
    for _ in range(depth):
        W = init_fn(width, width)
        x = np.maximum(0, x @ W)       # ReLU layer
        stds.append(x.std())
    return stds

# Too small: signal vanishes
tiny  = lambda i,o: np.random.randn(i,o) * 0.01
# Too large: signal explodes
huge  = lambda i,o: np.random.randn(i,o) * 1.0
# He init: signal preserved
he    = lambda i,o: np.random.randn(i,o) * np.sqrt(2.0/i)

print(f"tiny: layer 50 std = {test_init(tiny)[-1]:.2e}")  # ~1e-30 vanished
print(f"huge: layer 50 std = {test_init(huge)[-1]:.2e}")  # ~1e+10 exploded
print(f"He:   layer 50 std = {test_init(he)[-1]:.2e}")    # ~1.0 stable!
Train Note: Transformer Initialization in Practice
GPT-2 initializes weights from N(0, 0.02) and scales residual-projection weights by 1/√(2N) where N is the number of layers, preventing residual-stream variance from growing with depth.
Modern LLMs often use 'mu-parametrization' (μP) to make optimal hyperparameters transfer across model scales, and many use small fixed std (0.006–0.02). Initialization remains an active area: small changes measurably affect training stability at scale.
10.5

Even with good initialization, the distribution of activations shifts during training as weights update — a phenomenon called internal covariate shift. Normalization layers re-center and re-scale activations on the fly, dramatically accelerating and stabilising training. The choice of normalization is one of the defining differences between CNN-era and Transformer-era architectures.

Batch Normalization

BatchNorm (Ioffe & Szegedy, 2015) normalizes each feature across the batch dimension, then applies a learned scale γ and shift β. It revolutionised CNN training but has a critical weakness for sequence models: it couples examples in a batch and behaves differently at train vs. inference time.

textBatch Normalization (per feature, across batch)
μ_B = (1/m) Σᵢ xᵢ           # batch mean
σ²_B = (1/m) Σᵢ (xᵢ - μ_B)²   # batch variance
x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)  # normalize
yᵢ = γ x̂ᵢ + β              # learned scale & shift

Layer Normalization

LayerNorm (Ba et al., 2016) normalizes across the feature dimension for each example independently. This makes it batch-size independent and identical at train and inference time — exactly what sequence models need. Every Transformer uses LayerNorm (or its RMS variant).

BatchNormLayerNorm
Normalizes across the batch dimensionNormalizes across the feature dimension
Couples examples in a batchEach example normalized independently
Different behaviour train vs. inferenceIdentical at train and inference
Breaks with batch size 1 or variable lengthWorks with any batch size or length
Dominant in CNNs (ResNet, etc.)Dominant in Transformers
Needs running statistics for inferenceNo running statistics needed

RMSNorm: The Modern Simplification

RMSNorm (Zhang & Sennrich, 2019) drops the mean-centering of LayerNorm, normalizing only by the root-mean-square of the activations. It is cheaper, and works as well or better in practice. LLaMA, Gemma, and most recent LLMs use RMSNorm.

textLayerNorm vs RMSNorm
LayerNorm:  y = γ · (x - μ) / √(σ² + ε) + β     # center AND scale
RMSNorm:    y = γ · x / √(mean(x²) + ε)          # scale only, no centering
PythonAll three normalizations from scratch
import numpy as np

def batch_norm(x, gamma, beta, eps=1e-5):  # x: (batch, features)
    mu  = x.mean(axis=0, keepdims=True)       # across BATCH
    var = x.var(axis=0, keepdims=True)
    return gamma * (x - mu) / np.sqrt(var + eps) + beta

def layer_norm(x, gamma, beta, eps=1e-5):  # x: (batch, features)
    mu  = x.mean(axis=1, keepdims=True)       # across FEATURES
    var = x.var(axis=1, keepdims=True)
    return gamma * (x - mu) / np.sqrt(var + eps) + beta

def rms_norm(x, gamma, eps=1e-5):  # no mean-centering, no beta
    rms = np.sqrt((x**2).mean(axis=1, keepdims=True) + eps)
    return gamma * x / rms

x = np.random.randn(4, 768) * 10 + 5  # large-mean activations
g = np.ones(768); b = np.zeros(768)

ln = layer_norm(x, g, b)
print(f"LayerNorm: per-row mean={ln.mean(1).mean():.4f}, std={ln.std(1).mean():.4f}")
# LayerNorm: per-row mean=0.0000, std=1.0000

rn = rms_norm(x, g)
print(f"RMSNorm:   per-row RMS={np.sqrt((rn**2).mean(1)).mean():.4f}")
# RMSNorm:   per-row RMS=1.0000 (but mean is NOT zeroed)
ML Connection: Pre-LN vs Post-LN Transformers
Where you place LayerNorm matters enormously. The original Transformer used Post-LN (norm AFTER the residual add); modern LLMs use Pre-LN (norm BEFORE the sublayer). Pre-LN keeps the residual stream clean, making very deep Transformers trainable without learning-rate warmup tricks.
You will see in Chapter 13 that the Pre-LN block computes: x + Attention(LN(x)) and x + FFN(LN(x)). This single placement decision was the difference between Transformers that train stably at 100+ layers and ones that diverge.
10.6

Dropout (Srivastava et al., 2014) is a regularization technique that randomly sets a fraction p of activations to zero during each training step. This prevents neurons from co-adapting — relying on specific other neurons — and forces redundant, robust representations. At inference, dropout is disabled and activations are used in full.

Inverted Dropout

The standard implementation is inverted dropout: during training, surviving activations are scaled up by 1/(1-p) so the expected value is preserved. This lets inference run with no scaling at all — the most common source of dropout bugs is forgetting this scaling.

textInverted dropout (training time)
mask ~ Bernoulli(1 - p)           # 1 with prob (1-p), else 0
y = (x ⊙ mask) / (1 - p)           # zero some, scale up the rest

# At inference: y = x  (no mask, no scaling)
PythonInverted dropout from scratch
import numpy as np

def dropout(x, p=0.5, training=True):
    """Inverted dropout. p = fraction to drop."""
    if not training or p == 0:
        return x                          # inference: identity
    mask = (np.random.rand(*x.shape) > p)  # keep with prob (1-p)
    return x * mask / (1 - p)           # zero & scale up survivors

# Expected value is preserved
x = np.ones(10000)
y = dropout(x, p=0.5, training=True)
print(f"Input mean:  {x.mean():.3f}")   # 1.000
print(f"Dropout mean: {y.mean():.3f}")  # ~1.000 (preserved by 1/(1-p) scaling)
print(f"Fraction zeroed: {(y==0).mean():.3f}")  # ~0.500
Intuition: Dropout as Ensemble Averaging
Each training step uses a different random subnetwork (different dropout mask). A network with n neurons has 2^n possible subnetworks, and dropout trains an exponentially large ensemble of them with shared weights.
At inference, using all neurons with the 1/(1-p) scaling approximates averaging this entire ensemble — a cheap approximation to model averaging, which is one of the most reliable ways to improve generalization.
Train Note: Dropout in Transformers
Transformers apply dropout in several places: after attention softmax (attention dropout), after each sublayer's output before the residual add (residual dropout), and on the embedding sum. Typical rates are 0.1 for large models, higher (0.3) for smaller models on smaller datasets.
The largest modern LLMs often use little or no dropout during pretraining — with internet-scale data, overfitting is less of a concern than underfitting, and the regularization comes from data diversity instead.
10.7

The loss function defines what the network optimizes. Chapter 4 derived cross-entropy from information theory; here we connect it to the network's output layer and survey the practical loss functions you will use.

TaskOutput layerLoss function
Binary classification1 unit + sigmoidBinary cross-entropy
Multiclass classificationK units + softmaxCategorical cross-entropy
Language modelingV units + softmaxCross-entropy over vocabulary
Regression1+ linear unitsMean squared error (MSE)
Robust regression1+ linear unitsHuber / mean absolute error
Multi-labelK units + sigmoidSum of binary cross-entropies
Contrastive / retrievalNormalized embeddingsInfoNCE / triplet loss

The pairing of softmax output with cross-entropy loss is special: as shown in Chapter 4, their combined gradient simplifies to the elegant (prediction − target). This is why nearly every classifier and language model uses this exact pairing — it is numerically stable and computationally trivial.

ML Connection: The LM Loss Is Just Cross-Entropy
A language model's training loss is categorical cross-entropy applied at every token position: for each position, the softmax output over the V-token vocabulary is compared to the one-hot true next token. The total loss is the mean over all positions.
This single loss — next-token cross-entropy — trains GPT, LLaMA, Claude, and every other autoregressive LLM. The astonishing capabilities of these models emerge entirely from minimizing this one simple objective at scale.
10.8

We now assemble everything — linear layers, activations, initialization, normalization, dropout, and loss — into a complete, trainable MLP. This is a miniature version of the architecture you will scale up to the Transformer.

PythonA complete MLP class from scratch
import numpy as np

class MLP:
    """Multi-layer perceptron with He init, ReLU, dropout, and softmax-CE."""
    def __init__(self, sizes, dropout=0.0, seed=0):
        rng = np.random.default_rng(seed)
        self.W, self.b = [], []
        for nin, nout in zip(sizes[:-1], sizes[1:]):
            self.W.append(rng.normal(0, np.sqrt(2/nin), (nin, nout)))  # He init
            self.b.append(np.zeros(nout))
        self.dropout = dropout

    def forward(self, x, training=True):
        self.cache = [x]
        for i, (W, b) in enumerate(zip(self.W, self.b))):
            z = x @ W + b
            if i < len(self.W) - 1:  # hidden layers
                x = np.maximum(0, z)        # ReLU
                if training and self.dropout > 0:
                    mask = (np.random.rand(*x.shape) > self.dropout) / (1 - self.dropout)
                    x = x * mask
            else:  # output layer: stable softmax
                z -= z.max(axis=1, keepdims=True)
                x = np.exp(z); x /= x.sum(axis=1, keepdims=True)
            self.cache.append(x)
        return x

    def loss_and_grad(self, probs, y):  # y: integer labels
        N = len(y)
        loss = -np.log(probs[np.arange(N), y] + 1e-12).mean()
        grad = probs.copy(); grad[np.arange(N), y] -= 1; grad /= N  # softmax-CE grad
        return loss, grad

# This MLP -- linear layers + ReLU + dropout + softmax-CE -- contains
# every concept needed to understand a Transformer's feed-forward network.

Shape Trace Through the MLP

Tracking tensor shapes is the single most important debugging skill. Here is the shape of data flowing through a 784→256→10 MLP (MNIST classifier) with a batch of 32:

Shape Trace: MLP forward pass (batch=32)

OperationShapeNote
input x(32, 784)flattened 28×28 image
x @ W1 + b1(32, 256)linear: 784→256
ReLU(32, 256)elementwise, shape unchanged
dropout(32, 256)mask + scale, shape unchanged
x @ W2 + b2(32, 10)linear: 256→10 (logits)
softmax(32, 10)row-wise probabilities
cross-entropy(scalar)mean over batch
10.9

Training a neural network is an empirical science. Loss curves, gradient norms, and activation statistics tell you what is happening inside the network. Learning to read these signals is what separates practitioners who can debug from those who can only guess.

Reading Loss Curves

SymptomLikely causeFix
Loss is NaN immediatelyExploding activations / log(0)Lower LR, gradient clip, check init
Loss flat, never decreasesLR too low, or dead ReLUsRaise LR, check init, use LeakyReLU
Loss decreases then explodesLR too highLower LR, add warmup, clip gradients
Train loss ↓, val loss ↑OverfittingMore data, dropout, weight decay, early stop
Both losses plateau highUnderfitting / too smallMore capacity, train longer, better features
Loss oscillates wildlyBatch too small / LR too highLarger batch, lower LR, more momentum
PythonCode Lab: training an MLP on MNIST with diagnostics
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, y = fetch_openml('mnist_784', return_X_y=True, as_frame=False)
X = X / 255.0; y = y.astype(int)  # normalize to [0,1]
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)

net = MLP([784, 256, 128, 10], dropout=0.2)
lr, batch = 0.1, 128

for epoch in range(10):
    perm = np.random.permutation(len(Xtr))
    for i in range(0, len(Xtr), batch):
        idx = perm[i:i+batch]
        probs = net.forward(Xtr[idx], training=True)
        loss, grad = net.loss_and_grad(probs, ytr[idx])
        net.backward(grad, lr)            # (backward + SGD update)
    # Diagnostics each epoch
    test_acc = (net.forward(Xte, training=False).argmax(1) == yte).mean()
    grad_norm = np.sqrt(sum((g**2).sum() for g in net.last_grads))
    print(f"epoch {epoch}: loss={loss:.3f} test_acc={test_acc:.3f} |grad|={grad_norm:.2f}")
# epoch 0: loss=0.412 test_acc=0.928 |grad|=1.84
# epoch 9: loss=0.087 test_acc=0.978 |grad|=0.43  -- healthy convergence
Train Note: The Overfit-One-Batch Test
Before launching a long training run, verify your model can overfit a single batch of ~10 examples to near-zero loss. If it cannot, you have a bug — in the model, the loss, the gradients, or the data pipeline.
This 60-second test catches the majority of implementation bugs before they waste hours of compute. It is the first thing experienced practitioners do with any new model.
10.10

Everything in this chapter is a literal component of the Transformer you will build in Chapter 13. The Transformer is not a fundamentally new kind of network — it is a particular arrangement of the same primitives: linear layers, activations, normalization, dropout, and residual connections.

This chapterRole in the Transformer block
Linear layer (W x + b)Q/K/V projections; FFN layers; output projection
Activation (GELU/SwiGLU)The nonlinearity inside the feed-forward network
LayerNorm / RMSNormApplied before attention and FFN (Pre-LN)
DropoutAfter attention softmax and after each sublayer
He / scaled initInitializes all projection matrices
Residual connectionx + sublayer(x) — the gradient highway from Ch. 9
Softmax + cross-entropyThe attention weights AND the final LM loss

The Transformer Block Preview

Here is the Pre-LN Transformer block you will build in Chapter 13, drawn as a stack. Notice that every box is a component from this chapter — only 'Multi-Head Attention' (Chapter 12) is new:

Arch Stack: Pre-LN Transformer Block

+ residual addx + FFN_out
Dropoutp = 0.1
Feed-Forward (SwiGLU)d → 4d → d
LayerNorm / RMSNormnormalize
+ residual addx + Attn_out
Dropoutp = 0.1
Multi-Head Attention(Chapter 12)
LayerNorm / RMSNormnormalize
input x(B, T, d)
You Already Understand 80% of a Transformer
Of the eight boxes in the Transformer block above, seven are components you have now built from scratch: linear layers, GELU/SwiGLU, LayerNorm, dropout, residual connections, and the softmax. Only multi-head attention remains.
Chapter 11 sharpens your backpropagation skills to handle these deep stacks; Chapter 12 builds the one missing piece, attention; and Chapter 13 assembles them all into the full Transformer. The hard conceptual work is largely behind you.
10.11

Component Quick-Reference

ComponentPurposeDefault choice (2024)
Hidden activationNonlinearityGELU or SwiGLU
Output activationMap to taskSoftmax (classification), linear (regression)
InitializationStable signal propagationHe (ReLU), scaled-by-depth (Transformer)
NormalizationStabilize activationsRMSNorm (LLMs), LayerNorm (general)
RegularizationPrevent overfittingDropout 0.1, weight decay 0.1
LossDefine objectiveCross-entropy (classification/LM)
OptimizerUpdate weightsAdamW (from Chapter 2)

Exercises

Exercises 1–10 are pen-and-paper; 11–20 require code.

Exercise 1: Pen & Paper
Prove that a single perceptron cannot represent XOR. Show that no weights w₁, w₂, b satisfy all four XOR constraints simultaneously.
Exercise 2: Pen & Paper
Show that an MLP with linear (identity) activations is equivalent to a single linear layer, regardless of depth. Why does this make nonlinearities essential?
Exercise 3: Pen & Paper
Construct explicit weights for a 2→2→1 MLP with ReLU that computes XOR exactly. (Hint: one hidden unit computes OR, the other AND.)
Exercise 4: Pen & Paper
Derive the gradient of GELU(x) = x·Φ(x) using the product rule. Express Φ'(x) in terms of the standard normal density.
Exercise 5: Derive
Derive the He initialization variance Var(W) = 2/n_in for a ReLU layer. Account for the fact that ReLU zeroes half the inputs, halving the variance.
Exercise 6: Pen & Paper
Show that LayerNorm is invariant to scaling and shifting of its input: LN(αx + β) = LN(x) (before the learned γ, β). Why is this property useful?
Exercise 7: Pen & Paper
Inverted dropout scales survivors by 1/(1-p). Prove this preserves the expected activation E[y] = E[x], so no scaling is needed at inference.
Exercise 8: Pen & Paper
Compare the parameter count of a standard ReLU FFN (d→4d→d) with a SwiGLU FFN. What hidden dimension makes SwiGLU parameter-matched to the standard FFN?
Exercise 9: Pen & Paper
Explain why BatchNorm fails for a Transformer processing variable-length sequences with batch size 1, while LayerNorm works fine.
Exercise 10: Pen & Paper
A 50-layer network with sigmoid activations (max gradient 0.25) is trained by backprop. Estimate the gradient magnitude reaching layer 1 relative to the output. Repeat for ReLU.
Exercise 11: Code
Implement a perceptron and confirm it learns AND, OR, NAND but fails on XOR. Plot the decision boundary it converges to for each.
Exercise 12: Code
Implement an MLP that solves XOR from scratch. Visualize the hidden-layer representation of the 4 input points — show they become linearly separable.
Exercise 13: Code
Plot all 7 activation functions and their derivatives on [-5, 5]. Annotate the maximum gradient of each and the regions where each saturates.
Exercise 14: Code Lab
Reproduce the signal-propagation experiment: pass a signal through 50 ReLU layers with tiny, huge, and He initialization. Plot activation std vs. depth for each. Confirm He preserves the signal.
Exercise 15: Code
Implement BatchNorm, LayerNorm, and RMSNorm from scratch. Apply each to a (4, 768) tensor with large mean and verify the normalization properties of each.
Exercise 16: Code
Implement inverted dropout. Empirically verify that the expected activation is preserved across 10,000 trials at p = 0.1, 0.3, 0.5.
Exercise 17: Code Lab
Build and train the complete MLP class on MNIST. Achieve >97% test accuracy. Plot train and validation loss curves and identify the onset of overfitting.
Exercise 18: Code
Ablation study: train the MNIST MLP with and without (a) He init, (b) dropout, (c) LayerNorm. Report the effect of each on final test accuracy and training stability.
Exercise 19: Code
Implement the overfit-one-batch test: confirm your MLP can drive the loss on a 10-example batch to near zero. Then deliberately introduce a bug (e.g., wrong gradient sign) and show the test catches it.
Exercise 20: Code (Challenge)
Build the Pre-LN block skeleton from Section 10.10 with a placeholder attention function (identity). Verify shapes flow correctly through LayerNorm, dropout, FFN, and residual adds for a (8, 64, 256) input. In Chapter 12 you will drop in real attention.

Further reading: “Deep Learning” (Goodfellow, Bengio, Courville, 2016) Chapters 6–8 — the canonical reference for MLPs, regularization, and optimization. “Delving Deep into Rectifiers” (He et al., 2015) for He initialization. “Layer Normalization” (Ba et al., 2016) and “Root Mean Square Layer Normalization” (Zhang & Sennrich, 2019). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” (Srivastava et al., 2014). Michael Nielsen's free online book “Neural Networks and Deep Learning” for unmatched intuition.


Next → Chapter 11: Backpropagation in Depth

You have built MLPs and updated their weights, but the backward passes so far have been hand-derived for each specific network. Chapter 11 generalizes this: we build a complete automatic differentiation engine that computes gradients for any computational graph. You will understand exactly how PyTorch and JAX work under the hood, build a working autograd system, and gain the skills to debug gradients through the deep stacks of a Transformer.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →