Part I: Mathematical Foundations

Chapter 3

Probability & Statistics

Distributions, Bayes, maximum likelihood, and information

20 Exercises

Learning Objectives

1.	Understand discrete and continuous probability distributions and sample from them in NumPy.
2.	Apply Bayes’ theorem to update beliefs with evidence, and connect it to MAP estimation.
3.	Derive maximum likelihood estimation (MLE) and understand why cross-entropy loss is MLE.
4.	Compute entropy, KL divergence, and cross-entropy for any pair of distributions.
5.	Explain and implement temperature scaling, top-k, and nucleus (top-p) sampling.
6.	Interpret perplexity as an evaluation metric and understand its limitations.
7.	Connect the entire probabilistic framework to the language modeling objective P(x₁...xₙ).

A language model assigns probabilities to sequences of tokens. To understand why that’s meaningful and how to train it, we need the vocabulary of probability theory: random variables, distributions, and the rules that govern them.

Random Variables

Random Variable

A variable X whose value is determined by a random experiment. We write P(X = x) for the probability that X takes value x.

Random variables come in two flavours that require different mathematical treatment:

Type	Description	ML examples
Discrete	Countable values. Probabilities sum to 1.	Token IDs, class labels, number of layers
Continuous	Uncountable values (real numbers). Probabilities integrate to 1.	Weight values, loss, activations, learning rate

Probability Mass & Density Functions

For discrete variables, we use a Probability Mass Function (PMF): P(X = x) directly gives the probability. For continuous variables, we use a Probability Density Function (PDF): probability is the area under the curve over an interval.

Python•PMF and PDF — discrete vs. continuous
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# ---- Discrete: Categorical distribution (token probabilities) ----
vocab      = ['the', 'cat', 'sat', 'on', 'mat']
probs      = np.array([0.40, 0.25, 0.15, 0.12, 0.08])
print(probs.sum())                    # 1.0 — must sum to 1

# PMF: P(X = 'cat') is just probs[1]
print(f"P(cat) = {probs[1]:.2f}")

# Sample from categorical (simulates LM token sampling)
samples = np.random.choice(vocab, size=10, p=probs)
print(samples)  # e.g. ['the' 'cat' 'the' 'on' 'the' ...]

# ---- Continuous: Gaussian distribution (weight init) ----
mu, sigma = 0.0, 0.02            # Xavier-like init for small nets
weights   = np.random.normal(mu, sigma, size=(768, 768))
print(f"Weight mean: {weights.mean():.4f}")   # ≈0.0
print(f"Weight std:  {weights.std():.4f}")   # ≈0.02

# PDF value at a point (NOT a probability — it’s a density)
pdf_at_zero = stats.norm.pdf(0, mu, sigma)  # ≈19.9 (>1 is fine for PDF!)

The Essential Distributions for ML

Distribution	Type	Parameters	Where in ML
Bernoulli	Discrete	p	Binary classification output, single dropout mask value
Categorical	Discrete	p₁...pₙ	Token prediction: softmax output over vocabulary
Gaussian / Normal	Continuous	μ, σ²	Weight init, VAE latent space, diffusion models
Uniform	Both	a, b	Random seeds, positional perturbations, random search
Dirichlet	Continuous	α₁...αₙ	Prior over categorical distributions in Bayesian NLP
Multinomial	Discrete	n, p	Generating n tokens from a vocabulary distribution

Language modeling is fundamentally about the joint distribution over sequences. The chain rule of probability allows us to factorize this frightening high-dimensional object into a product of tractable conditional distributions — one per token.

Joint Distribution P(X, Y)

The joint distribution captures the probability of two (or more) variables taking specific values simultaneously. A language model’s true goal is to learn the joint distribution over all tokens in a sequence.

Python•Joint distribution — two discrete variables
import numpy as np

# Joint distribution P(Weather, Mood)
# Rows: Weather = {Sunny, Rainy}
# Cols: Mood    = {Happy, Sad}
joint = np.array([
    [0.40, 0.10],   # Sunny: 40% happy, 10% sad
    [0.15, 0.35],   # Rainy: 15% happy, 35% sad
])
print(joint.sum())                        # 1.0

# Marginal P(Weather): sum over Mood (axis=1)
p_weather = joint.sum(axis=1)               # [0.5, 0.5] Sunny/Rainy equally likely

# Marginal P(Mood): sum over Weather (axis=0)
p_mood    = joint.sum(axis=0)               # [0.55, 0.45] Happy/Sad

# Conditional P(Mood | Weather=Sunny): joint / marginal
p_mood_given_sunny = joint[0] / p_weather[0]  # [0.8, 0.2]
print(p_mood_given_sunny)                  # [0.8  0.2] — sunny → usually happy

The Chain Rule of Probability

Any joint distribution can be factorized into a product of conditionals. The order of factorization is arbitrary — left-to-right is just a convention:

P(x₁, x₂, ..., xₙ) = P(x₁) × P(x₂ | x₁) × P(x₃ | x₁,x₂) × ... × P(xₙ | x₁...x_{n-1})

✧

This IS Language Modeling

Every autoregressive language model — GPT, LLaMA, Mistral — is trained to estimate exactly these conditional probabilities: P(next token | all previous tokens).

At inference time, the model samples one token at a time, each conditioned on all previously generated tokens. The chain rule guarantees this produces valid samples from the full joint distribution over sequences.

Python•Autoregressive factorization in practice
import numpy as np

# Simulate a tiny 4-token language model
# P(sequence) = P(x1) * P(x2|x1) * P(x3|x1,x2) * ...

np.random.seed(42)
vocab = ['<s>', 'the', 'cat', 'sat', '</s>']

def mock_lm_probs(context):
    """Mock language model: returns next-token probs given context."""
    if context == ['<s>']:         return [0.0, 0.7, 0.2, 0.0, 0.1]  # after <s>
    if context[-1] == 'the':       return [0.0, 0.0, 0.6, 0.1, 0.3]  # after 'the'
    if context[-1] == 'cat':       return [0.0, 0.0, 0.0, 0.8, 0.2]  # after 'cat'
    return [0.0, 0.0, 0.0, 0.0, 1.0]               # default: end

# Generate a sequence and compute its probability
context     = ['<s>']
log_prob    = 0.0  # work in log space to avoid underflow
for step in range(3):
    probs   = mock_lm_probs(context)
    tok_idx = np.random.choice(len(vocab), p=probs)
    tok     = vocab[tok_idx]
    log_prob += np.log(probs[tok_idx] + 1e-10)  # log P(x_t | x_{<t})
    context.append(tok)
    print(f"Step {step+1}: generated '{tok}', log_prob so far = {log_prob:.3f}")
print(f"Sequence: {context}  P(seq) ≈ {np.exp(log_prob):.4f}")

Independence & Conditional Independence

Two variables X and Y are independent if knowing one tells you nothing about the other: P(X,Y) = P(X)P(Y). Conditional independence is the weaker (and more practically useful) form: X ⊥ Y | Z means X and Y are independent once we know Z.

▶

ML Connection: The Naive Bayes Assumption

The Naive Bayes classifier assumes all features are conditionally independent given the class label. This is almost never true, yet NB works surprisingly well — demonstrating that conditional independence is a useful modelling simplification even when violated.

Transformers implicitly model complex dependencies between all tokens through attention, replacing the independence assumption entirely.

Bayes’ theorem is arguably the most important equation in machine learning. It describes how to rationally update a prior belief in the light of new evidence. Every Bayesian model, MAP estimator, and probabilistic neural network is an application of this rule.

The Formula

P(θ | data) = P(data | θ) × P(θ) / P(data)

Term	Name	Meaning in ML
P(θ \| data)	Posterior	What we believe about θ after seeing the data
P(data \| θ)	Likelihood	How probable the data is under parameters θ
P(θ)	Prior	What we believed about θ before seeing any data
P(data)	Evidence / marginal	Normalizing constant (often intractable)

Python•Bayes’ theorem — medical test example
# Classic example: disease testing
# Shows why base rates (priors) matter enormously

p_disease     = 0.001   # P(disease): 1 in 1000 people have it
p_pos_given_d = 0.99    # P(positive | disease): 99% sensitivity
p_pos_given_nd= 0.05    # P(positive | no disease): 5% false positive

# P(positive) by total probability theorem
p_positive = (p_pos_given_d  * p_disease +
               p_pos_given_nd * (1 - p_disease))

# Bayes: P(disease | positive)
p_disease_given_pos = (p_pos_given_d * p_disease) / p_positive
print(f"P(disease | positive test) = {p_disease_given_pos:.1%}")
# ≈ 1.9%  — despite a 99% accurate test!

# Intuition: 1000 people → 1 has disease (1 likely positive)
#            999 don't  → ~50 false positives
# So among ~51 positives, only 1 is truly sick: 1/51 ≈ 2%

Maximum A Posteriori (MAP) Estimation

MLE finds the θ that maximises P(data|θ). MAP goes one step further and finds the θ that maximises the posterior P(θ|data) = P(data|θ) × P(θ). The prior acts as a regularizer.

MAP: θ* = argmax_θ [log P(data|θ) + log P(θ)]

MLE: θ* = argmax_θ log P(data|θ) (no prior)

Python•MAP = MLE + regularization
import numpy as np

# Training loss with L2 regularization = MAP with Gaussian prior
# L2 prior: P(θ) ∝ exp(-λ||θ||^2)  →  log P(θ) = -λ||θ||^2
# MAP objective: maximize log P(data|θ) - λ||θ||^2
# = minimize NLL + λ||θ||^2   ← standard L2-regularized training!

def map_loss(predictions, targets, weights, lam=0.01):
    """MAP loss = NLL + L2 regularization (Gaussian prior on weights)."""
    nll = -np.mean(targets * np.log(predictions + 1e-9)) # cross-entropy
    reg = lam * np.sum(weights**2)                       # L2 prior
    return nll + reg

# L1 regularization corresponds to a Laplace prior on weights:
# P(θ) ∝ exp(-λ||θ||_1)  →  promotes sparsity
def map_loss_l1(predictions, targets, weights, lam=0.01):
    nll = -np.mean(targets * np.log(predictions + 1e-9))
    reg = lam * np.sum(np.abs(weights))
    return nll + reg

▶

ML Connection: Weight Decay IS a Gaussian Prior

When you set weight_decay=0.1 in AdamW, you are implicitly placing a Gaussian prior on every weight: P(w) ∝ exp(-λ w²). MAP estimation then penalizes large weights, acting as a soft constraint that keeps the model from overfitting.

L1 regularization (weight decay on absolute values) corresponds to a Laplace prior and promotes sparsity — many weights become exactly zero. This is why L1 is used in feature selection but L2 is preferred for neural networks (sparse weights are hard to exploit on GPUs).

Rather than working with full distributions, we often summarize them with moments: the expected value (centre of mass), variance (spread), and covariance (linear dependence between two variables). These quantities appear throughout neural network design and analysis.

Expected Value

E[X] = Σ_x x × P(X=x) (discrete)

E[X] = ∫ x × p(x) dx (continuous)

Python•Expected value — theory and estimation
import numpy as np

# ---- Discrete: expected number of heads in 3 coin flips ----
# X ~ Binomial(n=3, p=0.5)
values = np.array([0, 1, 2, 3])
probs  = np.array([0.125, 0.375, 0.375, 0.125])
E_X = np.sum(values * probs)
print(f"E[X] = {E_X}")  # 1.5 = n*p = 3*0.5

# ---- Monte Carlo estimation: sample and average ----
# (This is how E[loss] is estimated during training!)
samples = np.random.binomial(n=3, p=0.5, size=100000)
print(f"MC estimate: {samples.mean():.4f}")  # ≈1.5

# ---- Continuous: E[X] for Gaussian μ=3, σ=2 ----
samples_g = np.random.normal(loc=3, scale=2, size=1000000)
print(f"E[N(3,4)]: {samples_g.mean():.4f}")  # ≈3.0

Variance & Standard Deviation

Var[X] = E[(X - E[X])²] = E[X²] - E[X]²

Python•Variance — numerically stable computation
import numpy as np

# Variance measures spread around the mean
x = np.array([2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0])
mean = x.mean()                      # 5.0
var  = np.mean((x - mean)**2)          # 4.0 (population variance)
std  = np.sqrt(var)                   # 2.0

# CAUTION: E[X^2] - E[X]^2 is numerically unstable for large values!
# np.var() and np.std() use the safer two-pass formula
print(np.var(x))        # 4.0 (matches)
print(np.var(x, ddof=1))  # 4.57 (sample variance, unbiased)

# Layer norm computes mean and variance per token
def layer_norm(x, eps=1e-5):
    mu  = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True, ddof=0)
    return (x - mu) / np.sqrt(var + eps)

acts    = np.random.randn(4, 768)      # (batch, dim)
normed  = layer_norm(acts)
print(normed.mean(axis=-1))  # [≈0, ≈0, ≈0, ≈0]
print(normed.std(axis=-1))   # [≈1, ≈1, ≈1, ≈1]

Covariance & Correlation

Cov[X,Y] = E[(X - E[X])(Y - E[Y])] Corr[X,Y] = Cov[X,Y] / (σ_X σ_Y) ∈ [-1,1]

Python•Covariance matrix — the foundation of PCA and attention
import numpy as np

np.random.seed(0)
# Simulate 3 features with known covariance structure
n    = 500
x1   = np.random.randn(n)
x2   = 0.8 * x1 + 0.6 * np.random.randn(n)   # strongly correlated with x1
x3   = np.random.randn(n)                    # independent
X    = np.stack([x1, x2, x3], axis=1)        # (500, 3)

cov = np.cov(X.T)  # (3,3) covariance matrix
print(np.round(cov, 2))
# [[ 1.00  0.81  0.01]   <- x1-x2 strongly correlated
#  [ 0.81  1.01 -0.02]
#  [ 0.01 -0.02  0.98]]  <- x3 near-zero correlation with x1, x2

# Correlation matrix: normalise to [-1, 1]
std  = np.sqrt(np.diag(cov))
corr = cov / np.outer(std, std)
print(np.round(corr, 2))

The Gaussian (Normal) distribution is the most important continuous distribution in machine learning. It appears in weight initialization, data augmentation, latent spaces of generative models, and noise models. Understanding it deeply — not just its formula — is essential.

PDF and Parameters

N(x; μ, σ²) = [1/(σ√2π)] × exp(-(x-μ)² / (2σ²))

μ is the mean (peak of the bell), σ² is the variance (width of the bell), and σ is the standard deviation. The 68-95-99.7 rule tells us that 68% of probability mass lies within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ.

Python•Gaussian distribution — properties and the CLT
import numpy as np
from scipy import stats

mu, sigma = 0.0, 1.0  # standard normal

# 68-95-99.7 rule: verified numerically
within_1s = stats.norm.cdf(mu+sigma) - stats.norm.cdf(mu-sigma)
within_2s = stats.norm.cdf(mu+2*sigma) - stats.norm.cdf(mu-2*sigma)
within_3s = stats.norm.cdf(mu+3*sigma) - stats.norm.cdf(mu-3*sigma)
print(f"±1σ: {within_1s:.1%}  ±2σ: {within_2s:.1%}  ±3σ: {within_3s:.1%}")
# ±1σ: 68.3%  ±2σ: 95.4%  ±3σ: 99.7%

# --- Central Limit Theorem: sums of ANY distribution converge to Gaussian ---
# Sum of 30 uniform[0,1] samples ≈ Gaussian
n_experiments = 10000
n_samples     = 30
sums = np.random.uniform(0, 1, size=(n_experiments, n_samples)).sum(axis=1)
_, p_value = stats.normaltest(sums)
print(f"Normality test p-value: {p_value:.3f}")  # >>0.05: looks Gaussian

# --- Xavier / He weight initialization ---
def xavier_init(fan_in, fan_out):
    """Keeps variance stable through layers of tanh activations."""
    std = np.sqrt(2.0 / (fan_in + fan_out))
    return np.random.normal(0, std, size=(fan_in, fan_out))

def he_init(fan_in, fan_out):
    """Keeps variance stable through layers of ReLU activations."""
    std = np.sqrt(2.0 / fan_in)
    return np.random.normal(0, std, size=(fan_in, fan_out))

W_xavier = xavier_init(512, 512) # std ≈ 0.0625
W_he     = he_init(512, 512)    # std ≈ 0.0625 * sqrt(2) ≈ 0.0884

✧

Why Gaussian Weight Initialization?

If we initialize weights from N(0, 1) and pass a signal through 100 layers, the variance either explodes (for large weights) or collapses to zero (for small weights). Xavier initialization carefully calibrates σ so that the variance of activations stays approximately constant across layers — enabling gradients to flow without vanishing or exploding even at initialization.

He initialization extends this for ReLU, which zeros out half the activations, effectively halving the variance. He compensates by scaling up by √2.

Multivariate Gaussian

For vectors in ℝⁿ, the Gaussian generalizes to the multivariate normal: N(x; μ, Σ) where Σ is the covariance matrix encoding the shape of the distribution. The covariance matrix determines whether the distribution is spherical, elongated, or tilted.

Python•Multivariate Gaussian — sampling and visualization
import numpy as np

# Spherical Gaussian: Σ = I (independent, equal variance)
mu_sphere     = [0.0, 0.0]
cov_sphere    = [[1.0, 0.0], [0.0, 1.0]]

# Elongated Gaussian: high variance in x1, low in x2
mu_elong      = [0.0, 0.0]
cov_elong     = [[4.0, 0.0], [0.0, 0.25]]

# Correlated Gaussian: positive off-diagonal
mu_corr       = [1.0, 2.0]
cov_corr      = [[1.0, 0.8], [0.8, 1.0]]

# Sample 500 points from each
samples_sphere = np.random.multivariate_normal(mu_sphere, cov_sphere, 500)
samples_elong  = np.random.multivariate_normal(mu_elong,  cov_elong,  500)
samples_corr   = np.random.multivariate_normal(mu_corr,   cov_corr,   500)

# In a VAE, we sample z ~ N(mu(x), diag(sigma^2(x)))
# where mu and sigma are output by the encoder network
def vae_reparameterize(mu, log_var):
    """Reparameterization trick: z = mu + eps*sigma, eps~N(0,I)."""
    std = np.exp(0.5 * log_var)
    eps = np.random.randn(*mu.shape)
    return mu + eps * std   # differentiable w.r.t. mu, log_var!

MLE is the principle behind training almost every neural network. The fundamental insight: choose model parameters that make the observed training data as probable as possible. It sounds obvious — but tracing this principle through to its mathematical conclusion reveals exactly why we use cross-entropy loss.

The Setup

We have a dataset D = {x₁, x₂, ..., xₙ} of n independent observations. Our model has parameters θ. The likelihood of observing this data under our model is:

L(θ) = P(D | θ) = P(x₁|θ) × P(x₂|θ) × ... × P(xₙ|θ)

MLE finds the θ that maximises L(θ). We work in log-space because: (1) products become sums (numerically stable), and (2) log is monotone so argmax is preserved.

log L(θ) = Σ_{i=1}^{n} log P(x_i | θ)

MLE Derivation: Categorical Distribution

For a vocabulary of V tokens, with counts cᵥ for each token v, the MLE solution is simply the empirical frequency. This is the simplest possible language model — a unigram.

Python•MLE for a categorical distribution — derived from scratch
import numpy as np
from collections import Counter

# Corpus of tokens
corpus = ['the', 'cat', 'sat', 'on', 'the', 'mat', 'the', 'cat']
counts = Counter(corpus)   # {'the':3, 'cat':2, 'sat':1, 'on':1, 'mat':1}
n      = len(corpus)

# MLE solution: p_v = count_v / n
mle_probs = {word: cnt/n for word, cnt in counts.items()}
print(mle_probs)  # {'the': 0.375, 'cat': 0.25, ...}

# Log-likelihood of the corpus under MLE model (maximum possible)
log_L = sum(np.log(mle_probs[w]) for w in corpus)
print(f"Log-likelihood: {log_L:.3f}")

# Compare to a uniform model (all words equally likely)
p_uniform = 1.0 / len(counts)
log_L_uni = n * np.log(p_uniform)
print(f"Uniform log-lik: {log_L_uni:.3f}")  # always lower

MLE = Minimizing Cross-Entropy Loss

Here is the key derivation. Training a neural language model with cross-entropy loss is exactly MLE over token predictions. Let’s prove it:

Python•Proof: cross-entropy loss == negative log-likelihood
# For a classification problem with one-hot targets:
# - Target distribution: q (one-hot, e.g. [0,0,1,0,0])
# - Model distribution:  p (softmax output, e.g. [.1,.2,.5,.1,.1])

import numpy as np

def cross_entropy_loss(logits, target_class):
    """Cross-entropy loss for a single example."""
    # Numerically stable softmax
    logits  = logits - logits.max()    # subtract max for stability
    probs   = np.exp(logits) / np.sum(np.exp(logits))
    # Loss = -log P(correct class)  [one-hot collapses sum]
    return -np.log(probs[target_class] + 1e-9)

# Example: 5-class logits, true class = 2
logits = np.array([1.0, 2.0, 3.0, 0.5, 0.2])
loss   = cross_entropy_loss(logits, target_class=2)
print(f"Loss = {loss:.4f}")   # lower = more confident in class 2

# Connection: minimizing CE  ==  maximizing log P(data|theta)
# CE = H(q, p) = -sum_v q(v) log p(v)
# With one-hot q: CE = -log p(correct_class) = -log-likelihood!
# Training objective: minimize (1/N) sum_i CE_i  ==  maximize log-likelihood

✧

The Training Objective Is Always MLE

When you train a language model with cross-entropy loss, you are maximizing the log-likelihood of the training tokens. The model is asked: ‘given the previous tokens, how much probability mass do you assign to the next true token?’

Minimizing CE(q, p_θ) over θ is equivalent to maximizing Σ log P(x_t | x_{<t}; θ), which is exactly MLE for the autoregressive factorization.

Entropy measures how uncertain a distribution is. KL divergence measures how different two distributions are. Both are fundamental to loss functions, evaluation metrics, and the mathematics of RLHF.

Entropy

H(P) = -Σ_x P(x) log P(x) [bits if log₂, nats if ln]

Entropy is maximized by the uniform distribution (maximum uncertainty) and equals zero for a deterministic distribution (no uncertainty).

Python•Entropy — intuition and computation
import numpy as np
from scipy.stats import entropy

def shannon_entropy(probs, base=np.e):
    """Shannon entropy. base=np.e gives nats, base=2 gives bits."""
    probs = np.array(probs)
    probs = probs[probs > 0]  # avoid log(0)
    return -np.sum(probs * np.log(probs) / np.log(base))

# Deterministic: one event with probability 1
print(shannon_entropy([1.0, 0.0, 0.0]))          # 0.0 nats (no uncertainty)

# Uniform over 2 outcomes (fair coin)
print(shannon_entropy([0.5, 0.5], base=2))  # 1.0 bit

# Uniform over 4 outcomes (2 coins)
print(shannon_entropy([.25]*4, base=2))  # 2.0 bits

# Realistic language model output (50257-token vocabulary)
logits_peaked = np.zeros(50257);  logits_peaked[42] = 10.0  # very confident
logits_flat   = np.zeros(50257)                            # maximally uncertain
def softmax(x): e = np.exp(x - x.max()); return e / e.sum()
print(f"Peaked H = {shannon_entropy(softmax(logits_peaked)):.2f} nats")  # ≈0.0
print(f"Flat   H = {shannon_entropy(softmax(logits_flat)):.2f} nats")  # ≈10.82

KL Divergence

D_KL(P ‖ Q) = Σ_x P(x) log [P(x) / Q(x)] = H(P, Q) - H(P)

KL divergence measures how many extra nats we spend on average when we use Q to encode messages from P. It is always ≥ 0 (Gibbs’ inequality) and equals zero iff P = Q. Crucially, it is NOT symmetric: D_KL(P‖Q) ≠ D_KL(Q‖P).

Python•KL divergence — asymmetry and RLHF connection
import numpy as np

def kl_divergence(P, Q, eps=1e-10):
    """KL(P || Q): how different Q is from P, measured from P's perspective."""
    P, Q = np.array(P) + eps, np.array(Q) + eps
    P, Q = P / P.sum(), Q / Q.sum()  # normalize
    return np.sum(P * np.log(P / Q))

# KL is NOT symmetric
P = [0.9, 0.1]  # true distribution (confident)
Q = [0.5, 0.5]  # model distribution (uncertain)
print(f"KL(P||Q) = {kl_divergence(P, Q):.4f}")  # 0.5108
print(f"KL(Q||P) = {kl_divergence(Q, P):.4f}")  # 0.4185 (different!)

# KL >= 0: proved empirically
for _ in range(1000):
    p = np.random.dirichlet(np.ones(10))
    q = np.random.dirichlet(np.ones(10))
    assert kl_divergence(p, q) >= -1e-9  # always true

# --- RLHF KL penalty ---
# RLHF objective: maximize E[r(x)] - beta * KL(pi_theta || pi_ref)
# The KL term penalises the fine-tuned policy pi_theta for
# drifting too far from the reference policy pi_ref.
# Without it, the model collapses to degenerate reward-hacking outputs.

▶

ML Connection: KL in RLHF

During RLHF fine-tuning (e.g., training ChatGPT-style models), the training objective is: maximize expected reward minus β × KL(π_θ ‖ π_ref).

The KL penalty is critical: without it, the model quickly learns to produce outputs that get high reward scores from the reward model while completely departing from the distribution of coherent text. The β hyperparameter controls how conservatively the model is allowed to deviate from its pretrained behaviour.

Once a language model produces logits over the vocabulary, those logits must be converted to a probability distribution and a token sampled from it. This sampling step is the primary knob for controlling the “creativity” of generated text.

Greedy vs. Sampling

Greedy decoding always picks the most probable token. It is fast and deterministic but tends to produce repetitive, conservative text. Sampling introduces randomness that makes outputs more diverse and often more natural.

Python•Greedy decoding vs. random sampling
import numpy as np

def softmax(logits):
    e = np.exp(logits - logits.max())  # numerically stable
    return e / e.sum()

vocab  = ['the', 'a', 'cat', 'dog', 'ran', 'sat', 'leapt']
logits = np.array([3.0, 2.0, 1.5, 1.2, 0.8, 0.4, 0.1])
probs  = softmax(logits)

# Greedy: always pick argmax
greedy_token = vocab[np.argmax(probs)]
print(f"Greedy: '{greedy_token}' (always)")

# Sampling: draw from distribution
np.random.seed(42)
samples = [vocab[np.random.choice(len(vocab), p=probs)] for _ in range(10)]
print(f"Samples: {samples}")
# e.g. ['the', 'the', 'a', 'the', 'cat', 'the', 'a', 'the', 'the', 'a']

Temperature Scaling

Temperature T scales the logits before softmax. T < 1 makes the distribution sharper (more deterministic). T > 1 makes it flatter (more random). T = 1 is unmodified sampling.

p_i = softmax(logits / T)_i

Python•Temperature scaling — full implementation and analysis
import numpy as np

def temperature_sample(logits, temperature=1.0):
    """Sample from logits with temperature scaling."""
    assert temperature > 0, "Temperature must be positive"
    scaled_logits = logits / temperature
    probs         = softmax(scaled_logits)
    return np.random.choice(len(probs), p=probs), probs

vocab  = ['the', 'a', 'cat', 'dog', 'ran', 'sat', 'leapt']
logits = np.array([3.0, 2.0, 1.5, 1.2, 0.8, 0.4, 0.1])

print(f"{'Temp':>6}  {'Distribution (top 3)':>40}")
for T in [0.1, 0.5, 1.0, 1.5, 2.0]:
    _, probs = temperature_sample(logits, T)
    top3    = sorted(zip(vocab, probs), key=lambda x: -x[1])[:3]
    top3_str = '  '.join(f"{w}:{p:.2f}" for w, p in top3)
    print(f"{T:>6.1f}  {top3_str}")
# T=0.1:  the:0.99   a:0.00   cat:0.00  (near-greedy)
# T=1.0:  the:0.44   a:0.21   cat:0.14  (default sampling)
# T=2.0:  the:0.26   a:0.21   cat:0.18  (almost uniform)

Top-k Sampling

Top-k sampling restricts sampling to the k most probable tokens, setting all others to zero probability. This prevents very unlikely tokens from ever being sampled, reducing “tail risk” in generation.

Python•Top-k sampling — implementation
import numpy as np

def top_k_sample(logits, k, temperature=1.0):
    """Zero out all but top-k logits, then sample with temperature."""
    assert 1 <= k <= len(logits)
    # Find k-th largest logit value
    threshold      = np.partition(logits, -k)[-k]
    # Mask out all logits below threshold
    masked_logits  = np.where(logits >= threshold, logits, -np.inf)
    scaled_logits  = masked_logits / temperature
    probs          = softmax(scaled_logits)
    return np.random.choice(len(probs), p=probs)

logits = np.array([3.0, 2.0, 1.5, 1.2, 0.8, 0.4, 0.1])

# With k=3, only tokens 0,1,2 are candidates
np.random.seed(42)
samples_k3 = [vocab[top_k_sample(logits, k=3)] for _ in range(20)]
from collections import Counter
print(Counter(samples_k3))  # only 'the', 'a', 'cat' appear

Nucleus (Top-p) Sampling

Top-p sampling (also called nucleus sampling) selects the smallest set of tokens whose cumulative probability exceeds p, then samples from that set. This dynamically adjusts the candidate set size — narrower when the model is confident, wider when uncertain.

Python•Top-p (nucleus) sampling — full implementation
import numpy as np

def nucleus_sample(logits, p, temperature=1.0):
    """
    Nucleus (top-p) sampling.
    Selects the smallest set of tokens whose cumulative prob >= p.
    """
    scaled_logits = logits / temperature
    probs         = softmax(scaled_logits)

    # Sort by probability descending
    sorted_idx   = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_idx]
    cumulative   = np.cumsum(sorted_probs)

    # Find cutoff: first index where cumulative >= p
    cutoff_idx = np.searchsorted(cumulative, p) + 1
    nucleus_idx = sorted_idx[:cutoff_idx]

    # Renormalise over nucleus and sample
    nucleus_probs = probs[nucleus_idx]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()
    chosen        = np.random.choice(nucleus_idx, p=nucleus_probs)
    return chosen

# Comparison: peaked distribution → small nucleus
peaked_logits = np.array([10.0, 1.0, 0.5, 0.1, 0.0, 0.0, 0.0])
flat_logits   = np.array([2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.8])

# Count nucleus size over many samples
def nucleus_size(logits, p):
    probs = softmax(logits)
    s_probs = np.sort(probs)[::-1]
    return int(np.searchsorted(np.cumsum(s_probs), p)) + 1
print(f"Peaked: nucleus size = {nucleus_size(peaked_logits, 0.9)} (p=0.9)") # 1
print(f"Flat:   nucleus size = {nucleus_size(flat_logits,   0.9)} (p=0.9)") # 6

✎

Code Lab 3.1 — Sampling Strategy Comparison

1. Load any pretrained GPT-2 with transformers. Generate 20 completions of the prompt 'The scientist discovered that' using: greedy, T=0.7, T=1.4, top-k=40, top-p=0.9.

2. Measure the token-level entropy of each generation method’s distributions.

3. Compute the distinct-1 and distinct-2 metrics (fraction of unique unigrams/bigrams) for each method across the 20 completions.

4. Plot a bar chart of diversity vs. entropy. Which setting gives the best creativity/coherence trade-off?

Perplexity is the exponential of the average cross-entropy per token. It measures how surprised, on average, the model is by each token in the evaluation set. A perplexity of k means the model is as confused as if it were choosing uniformly from k equally likely options at every step.

Definition

PP(P, Q) = exp(H(P, Q)) = exp(-Σ P(x) log Q(x))

For a language model: PP = exp(-(1/N) Σ_{t=1}^{N} log P(x_t | x_{<t}))

Python•Perplexity — from scratch and with transformers
import numpy as np

def perplexity_from_log_probs(log_probs):
    """
    Compute perplexity from a list of log P(x_t | x_{<t}).
    log_probs: list of log-probabilities for each token.
    """
    avg_nll = -np.mean(log_probs)  # average negative log-likelihood
    return np.exp(avg_nll)

# Perfect model: always predicts correct token with prob 1.0
perfect_log_probs = [np.log(1.0)] * 100
print(perplexity_from_log_probs(perfect_log_probs))  # 1.0

# Uniform model: 50257-token vocabulary
uniform_log_probs = [np.log(1/50257)] * 100
print(perplexity_from_log_probs(uniform_log_probs))  # 50257.0

# GPT-2 large on WikiText-103: PP ≈18
# GPT-3 175B on PTB: PP ≈20.5
# Random: PP = vocab_size ≈50257 for GPT tokenizer

# --- Using HuggingFace to compute real perplexity ---
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

def compute_perplexity(text, model, tokenizer, stride=512):
    """Sliding-window perplexity (handles texts longer than max_length)."""
    encodings = tokenizer(text, return_tensors='pt')
    input_ids = encodings.input_ids
    nlls, prev_end = [], 0
    for begin in range(0, input_ids.size(1), stride):
        end          = min(begin + 1024, input_ids.size(1))
        target_len   = end - prev_end
        chunk        = input_ids[:, begin:end]
        target_ids   = chunk.clone()
        target_ids[:, :-target_len] = -100  # ignore context tokens
        with torch.no_grad():
            loss = model(chunk, labels=target_ids).loss
        nlls.append(loss * target_len)
        prev_end = end
    return torch.exp(torch.stack(nlls).sum() / prev_end).item()

Perplexity Across Models — A Reference Table

Model	Dataset	Perplexity	Note
Random (uniform)	Any	≈50,257	GPT-2 tokenizer vocab
Trigram LM	PTB	≈74	Classical n-gram baseline
GPT-2 small (117M)	WikiText-103	≈37.5	Character-level proxy
GPT-2 large (774M)	WikiText-103	≈18.3	Open-source 2019
GPT-3 (175B)	Penn TreeBank	≈20.5	Brown et al. 2020
LLaMA-2 13B	WikiText-2	≈7.5	Meta, 2023
GPT-4 (estimated)	Various	< 5	Estimated, unreported

⚠️

Perplexity Limitations

Perplexity depends on tokenization: models with smaller vocabularies will have higher PP even if equally capable, because each “event” carries more information.

Perplexity cannot be compared across models using different tokenizers. Always report dataset, tokenizer, and stride length when publishing PP numbers.

When does a model memorize training data versus generalize to new examples? The bias–variance framework provides the answer, giving a decomposition of prediction error into three components with very different remedies.

The Decomposition

E[(y - f̂(x))²] = Bias[ˆf]² + Var[ˆf] + σ²_noise

Term	Meaning	Remedy
Bias²	Systematic error: how far the model’s average prediction is from the truth. High in underfitting models.	Increase model capacity, train longer, use better features
Variance	Sensitivity to training data: how much predictions change with different datasets. High in overfitting models.	More data, regularization, dropout, early stopping
Noise σ²	Irreducible error: inherent randomness in the labels. No model can predict this.	Data cleaning, better labelling, no remedy for true noise

Python•Bias-variance decomposition — empirical demonstration
import numpy as np

np.random.seed(0)
true_fn = lambda x: np.sin(2 * np.pi * x)

def fit_poly(X_train, y_train, degree, x_test):
    """Fit a polynomial of given degree, evaluate at x_test."""
    coeffs = np.polyfit(X_train, y_train, degree)
    return np.polyval(coeffs, x_test)

n_datasets, n_train = 200, 20
x_test = 0.5
sigma  = 0.3              # noise level

print(f"{'Degree':>8} {'Bias^2':>10} {'Variance':>10} {'Total MSE':>12}")
for degree in [1, 3, 7, 15]:
    preds = []
    for _ in range(n_datasets):
        X_tr = np.random.uniform(0, 1, n_train)
        y_tr = true_fn(X_tr) + sigma * np.random.randn(n_train)
        preds.append(fit_poly(X_tr, y_tr, degree, x_test))
    preds  = np.array(preds)
    y_true = true_fn(x_test)
    bias2  = (preds.mean() - y_true) ** 2
    var    = preds.var()
    mse    = bias2 + var + sigma**2
    print(f"{degree:>8} {bias2:>10.4f} {var:>10.4f} {mse:>12.4f}")
# degree=1:  high bias,   low  variance  (underfitting)
# degree=7:  low  bias,   low  variance  (sweet spot)
# degree=15: low  bias,   HIGH variance  (overfitting)

▶

ML Connection: LLMs Are in a New Regime

The bias-variance trade-off predicts that large models (low bias) should overfit badly. Modern LLMs appear to violate this — they generalize well despite being massively overparameterized. This is the ‘double descent’ phenomenon: beyond the interpolation threshold, more parameters can actually reduce test error.

Current understanding: LLMs are regularized implicitly by the optimizer (SGD-like methods prefer ‘flatter’ minima), by the training data diversity, and by early stopping. The classical bias-variance trade-off remains useful for smaller-scale ML tasks.

Concept Map

Concept	Key formula / takeaway
PMF / PDF	P(X=x) sums to 1 (discrete) \| p(x) integrates to 1 (continuous)
Chain rule of probability	P(x₁...xₙ) = Π P(x_t \| x_{<t}) [autoregressive LM]
Bayes’ theorem	P(θ\|D) ∝ P(D\|θ) P(θ) [posterior ∝ likelihood × prior]
MLE	argmax_θ Σ log P(x_i\|θ) = argmin CE loss
MAP	MLE + log prior = MLE + regularization
Entropy	H(P) = -Σ P(x) log P(x) [bits of uncertainty]
KL divergence	D_KL(P‖Q) ≥ 0, = 0 iff P = Q; not symmetric
Cross-entropy	H(P,Q) = H(P) + D_KL(P‖Q) ≥ H(P)
Perplexity	PP = exp(H(P,Q)); lower is better; tokenizer-dependent
Temperature	p_i = softmax(logits/T); T<1 sharper, T>1 flatter
Top-p sampling	Sample from smallest set with cumulative prob ≥ p

Exercises

Exercises 1–12 are pen-and-paper; 13–20 require code. The coding exercises build a miniature language modelling pipeline from scratch.

✎

Exercise 1: Pen & Paper

A bag has 3 red and 7 blue balls. You draw 2 without replacement. What is P(both red)?

✎

Exercise 2: Pen & Paper

A spam filter has sensitivity 95% and specificity 99%. Spam accounts for 2% of email. Given a positive detection, what is P(spam)?

✎

Exercise 3: Pen & Paper

Prove that the MLE for a Bernoulli distribution is p̂ = k/n, where k is the number of successes in n trials.

✎

Exercise 4: Pen & Paper

Show that L2 regularization on the weights is equivalent to MAP estimation with a Gaussian prior N(0, 1/(2λ)I) on θ.

✎

Exercise 5: Pen & Paper

Compute H([0.5, 0.5]) and H([0.9, 0.1]) in bits. Which has higher entropy? Why?

✎

Exercise 6: Pen & Paper

Compute KL(P‖Q) where P=[0.4, 0.6] and Q=[0.5, 0.5]. Verify KL ≥ 0.

✎

Exercise 7: Pen & Paper

Cross-entropy H(P,Q) = H(P) + KL(P‖Q). Verify this algebraically.

✎

Exercise 8: Pen & Paper

A model assigns probability 0.01 to the true next token. What is the contribution to perplexity from that token?

✎

Exercise 9: Pen & Paper

Why does sampling with T → 0 converge to greedy decoding? Show the limiting behavior of softmax(logits/T) as T → 0.

✎

Exercise 10: Pen & Paper

Explain why D_KL(P‖Q) ≠ D_KL(Q‖P). Give an ML scenario where the asymmetry matters.

✎

Exercise 11: Pen & Paper

The variance of a Bernoulli(p) random variable is p(1-p). At what value of p is variance maximized? Minimize it?

✎

Exercise 12: Pen & Paper

Derive the gradient of cross-entropy loss w.r.t. logits before softmax. Show it equals p_i - y_i where p is the softmax output and y is the one-hot target.

✎

Exercise 13: Code

Build a unigram language model on a text file. Compute its perplexity on a held-out set. Compare to a random model.

✎

Exercise 14: Code

Implement a bigram language model using conditional counts. Generate 50 tokens by sampling autoregressively.

✎

Exercise 15: Code Lab 3.1

Compare greedy, T=0.7, top-k=40, top-p=0.9 sampling on GPT-2. Measure distinct-1, distinct-2 diversity metrics.

✎

Exercise 16: Code

Plot the KL divergence D_KL(P‖Q) as a function of Q for a fixed P=[0.6, 0.4]. What happens as Q approaches [1,0] or [0,1]?

✎

Exercise 17: Code

Implement and verify the bias-variance decomposition empirically for polynomial regression of degree 1, 5, 10, 20.

✎

Exercise 18: Code

Implement temperature, top-k, and top-p sampling from scratch on a 1000-token vocabulary with random logits. Test edge cases: T=0.001, k=1, p=0.01.

✎

Exercise 19: Code

Compute perplexity of GPT-2 small on: (a) a paragraph of Wikipedia, (b) random English words, (c) random characters. Explain the differences.

✎

Exercise 20: Code (Challenge)

Implement the full MLE training loop for a trigram language model with add-k smoothing. Tune k to minimize perplexity on a validation set.

Further reading: “Pattern Recognition and Machine Learning” (Bishop, 2006) — Chapters 1–3 for a comprehensive Bayesian treatment. “The Elements of Statistical Learning” (Hastie et al.) for bias-variance and classical estimators. 3Blue1Brown’s Bayes’ theorem video for geometric intuition.

Next: Chapter 4 — Information Theory. We complete the probabilistic picture by formalizing entropy, mutual information, and the connection between compression and prediction.

✎ 20 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 2. Calculus & Optimization

NextCh 4. Information Theory

→

Probability & Statistics

Learning Objectives

Random Variables & Probability Distributions