Information Theory
Every time a language model predicts the next token, it is implicitly making a bet on how surprising that token will be. The model that wins is the one that assigns the least surprise — in expectation — to the true tokens. Information theory makes this intuition precise.
Claude Shannon founded information theory in 1948 with a deceptively simple question: “How much information is in a message?” His answer — it depends on how surprising it is — turns out to be the correct foundation for understanding every loss function in deep learning.
Three Equivalent Framings
Before defining entropy (which averages over an entire distribution), we define surprisal: the information content of a single event. The rarer the event, the more surprising it is, and the more information it carries.
Definition
Three properties that uniquely characterise a reasonable measure of surprise:
The only function satisfying all three is the negative logarithm.
import numpy as np
def surprisal(p, base=2):
"""Surprisal (self-information) of an event with probability p."""
assert 0 < p <= 1, f"Probability must be in (0,1], got {p}"
return -np.log(p) / np.log(base) # change of base
# Coin flip: P(heads) = 0.5 → 1 bit of surprise
print(f"Fair coin flip: {surprisal(0.5):.2f} bits")
print(f"1-in-4 outcome: {surprisal(0.25):.2f} bits")
print(f"Certain event: {surprisal(1.0):.2f} bits")
print(f"1-in-million: {surprisal(1e-6):.2f} bits")
# Token-level surprisal in a language model
# When the model assigns P('the') = 0.40, the surprisal is:
p_the = 0.40
print(f"'the' surprisal: {surprisal(p_the):.3f} bits")
# 1.322 bits -- less than a fair coin because 'the' is common
# Rare technical term: model assigns P('eigenvector') = 0.0001
p_eig = 0.0001
print(f"'eigenvector' surprisal: {surprisal(p_eig):.3f} bits")
# 13.288 bits -- very surprising to a general model
# Independent events: surprisals ADD (probabilities MULTIPLY)
p_seq = 0.5 * 0.25 # P('heads' then '1-in-4') -- independent
direct = surprisal(p_seq)
summed = surprisal(0.5) + surprisal(0.25)
print(f"Direct: {direct:.2f} Summed: {summed:.2f}")
# Direct: 3.00 Summed: 3.00 ✔Surprisal and the NLL Loss
Every cross-entropy training step computes the surprisal of the correct token under the model and tries to reduce it. The negative log-likelihood (NLL) loss is exactly the average surprisal across all training tokens:
import numpy as np
import torch
import torch.nn.functional as F
# Simulate model logits for 5 tokens, vocab size 10
torch.manual_seed(42)
logits = torch.randn(5, 10) # (T=5, V=10)
targets = torch.tensor([3, 7, 1, 4, 9]) # true next tokens
# Standard cross-entropy loss
ce_loss = F.cross_entropy(logits, targets)
# Manually: average surprisal of true tokens
probs = F.softmax(logits, dim=-1)
true_probs = probs[torch.arange(5), targets]
surprisals = -torch.log(true_probs)
manual_loss = surprisals.mean()
print(f"CE loss: {ce_loss.item():.4f}")
print(f"Manual (avg surprisal): {manual_loss.item():.4f}")
print(f"Token surprisals: {surprisals.tolist()}")
print(torch.allclose(ce_loss, manual_loss))
# True -- they are identical!Entropy is the expected surprisal. Where surprisal tells you how surprising a single outcome was, entropy tells you how surprising events from a distribution are, on average. It is the single most important quantity in information theory.
Definition and Derivation
This is not an arbitrary formula. Shannon showed it is the unique function satisfying three natural axioms: continuity in probabilities, maximum for the uniform distribution, and additivity for independent sub-experiments. The log is the only function that delivers all three.
import numpy as np
def entropy(probs, base=np.e):
"""Shannon entropy. base=np.e → nats; base=2 → bits."""
probs = np.asarray(probs, dtype=float)
probs = probs[probs > 0] # 0*log(0) = 0 by convention
return -np.sum(probs * np.log(probs)) / np.log(base)
# ---- Case 1: Deterministic (no uncertainty) ----
print(f"Deterministic: {entropy([1.0, 0.0, 0.0]):.3f} bits")
# 0.000 bits -- knowing the outcome is FREE (no encoding needed)
# ---- Case 2: Fair coin (maximum 2-outcome entropy) ----
print(f"Fair coin: {entropy([0.5, 0.5], base=2):.3f} bits")
# 1.000 bit -- need exactly 1 bit to transmit the outcome
# ---- Case 3: Biased coin ----
print(f"Biased (0.9): {entropy([0.9, 0.1], base=2):.3f} bits")
# 0.469 bits -- predictable, so little info per flip
# ---- Case 4: Uniform over vocabulary ----
V = 50257 # GPT-2 tokenizer vocab size
print(f"Uniform vocab: {entropy([1/V]*V, base=2):.2f} bits")
# 15.61 bits -- a random model needs 15.6 bits per token
# ---- Entropy of a realistic LM output distribution ----
np.random.seed(42)
logits = np.random.randn(V)
def softmax(x): e=np.exp(x-x.max()); return e/e.sum()
probs = softmax(logits)
print(f"Random logits: {entropy(probs, base=2):.2f} bits")
# ~15.5 bits -- still near-uniform after random init
# Push one logit high to simulate a confident model
logits[42] += 10.0
probs_conf = softmax(logits)
print(f"Confident model:{entropy(probs_conf, base=2):.2f} bits")
# ~0.06 bits -- nearly certain, very low entropyProperties of Entropy
Cross-entropy H(P,Q) measures the average number of bits needed to encode events from distribution P when we use a code optimised for distribution Q. When our model Q differs from the true distribution P, we spend extra bits — exactly the KL divergence worth.
Definition
Key inequality: H(P,Q) ≥ H(P) always, with equality iff P = Q. The gap is the KL divergence. Training minimises this gap.
import numpy as np
import torch
import torch.nn.functional as F
from scipy.special import entr
P = np.array([0.4, 0.3, 0.2, 0.1]) # true distribution
Q = np.array([0.25, 0.25, 0.25, 0.25]) # uniform model
# ---- Method 1: from definition ----
H_PQ_def = -np.sum(P * np.log(Q))
# ---- Method 2: H(P) + KL(P||Q) ----
H_P = -np.sum(P * np.log(P))
KL_PQ = np.sum(P * np.log(P / Q))
H_PQ_comp = H_P + KL_PQ
# ---- Method 3: torch.nn.functional (one-hot targets) ----
logits_q = torch.tensor(np.log(Q), dtype=torch.float32)
sample = torch.tensor([0, 1, 2, 3], dtype=torch.long) # simulate 4 samples
weights = torch.tensor(P, dtype=torch.float32)
ce_torch = (weights * F.cross_entropy(logits_q.unsqueeze(0).expand(4,-1), sample, reduction='none')).sum()
print(f"H(P,Q) definition: {H_PQ_def:.4f} nats")
print(f"H(P,Q) = H(P)+KL: {H_PQ_comp:.4f} nats")
print(f"H(P) = {H_P:.4f} nats (irreducible floor)")
print(f"KL gap = {KL_PQ:.4f} nats (what training closes)")One-Hot Targets: Why NLL Equals Cross-Entropy
In classification, the “true” distribution P is a one-hot vector: all mass on the correct class. This simplifies cross-entropy dramatically, because the sum collapses to a single term:
import numpy as np
def cross_entropy_onehot(logits, true_class):
"""Cross-entropy with one-hot P = NLL of true class."""
# Numerically stable softmax
logits = logits - logits.max()
probs = np.exp(logits) / np.exp(logits).sum()
# Cross-entropy: -sum_c P(c) log Q(c)
# With one-hot P: only the true class contributes
ce = -np.log(probs[true_class] + 1e-12) # == NLL
return ce, probs
logits = np.array([1.0, 3.0, 2.0, 0.5])
for true_class in range(4):
ce, probs = cross_entropy_onehot(logits, true_class)
print(f"True={true_class} P(correct)={probs[true_class]:.3f} CE={ce:.3f} nats")
# True=0 P(correct)=0.090 CE=2.411 (hard: model favours class 1)
# True=1 P(correct)=0.665 CE=0.408 (easy: model correctly confident)
# True=2 P(correct)=0.245 CE=1.408 (medium)
# True=3 P(correct)=0.049 CE=3.016 (hard: model very wrong)
# Note: lower CE → model is more confident in the right answerThe Gradient of Cross-Entropy w.r.t. Logits
One reason cross-entropy is so popular: its gradient w.r.t. the pre-softmax logits has an elegant closed form that is numerically stable and computationally efficient.
Where p_c = softmax(z)_c and y_c is the one-hot target. The gradient is simply the difference between the model’s prediction and the truth. This is the most elegant result in the chapter.
import numpy as np
def softmax(z): e=np.exp(z-z.max()); return e/e.sum()
def ce_and_grad(z, y_class):
"""Cross-entropy loss and its gradient w.r.t. logits z."""
p = softmax(z)
loss = -np.log(p[y_class] + 1e-12)
# Gradient: p - one_hot(y)
grad = p.copy()
grad[y_class] -= 1.0 # subtract 1 from correct class
return loss, grad
z = np.array([1.0, 3.0, 2.0, 0.5]) # logits
y = 2 # true class
loss, grad_analytic = ce_and_grad(z, y)
# Verify against numerical gradient
h = 1e-5
grad_numerical = np.zeros_like(z)
for i in range(len(z)):
zp, zm = z.copy(), z.copy()
zp[i] += h; zm[i] -= h
lp = -np.log(softmax(zp)[y]+1e-12)
lm = -np.log(softmax(zm)[y]+1e-12)
grad_numerical[i] = (lp - lm) / (2*h)
print(f"Analytic grad: {np.round(grad_analytic, 4)}")
print(f"Numerical grad: {np.round(grad_numerical, 4)}")
print(np.allclose(grad_analytic, grad_numerical, atol=1e-7))
# True -- gradient is exactly p - y_onehotWe met KL divergence in Chapter 3. Here we examine it more deeply: its geometric interpretation, its asymmetry explained carefully, and its central role in variational inference, RLHF, and contrastive learning.
Definition and Decomposition
import numpy as np
def kl_div(P, Q, eps=1e-12):
"""KL(P || Q): extra bits when using Q to encode messages from P."""
P, Q = np.asarray(P)+eps, np.asarray(Q)+eps
P /= P.sum(); Q /= Q.sum()
return float(np.sum(P * np.log(P/Q)))
# ---- KL is always non-negative ----
for _ in range(10000):
P = np.random.dirichlet(np.ones(5))
Q = np.random.dirichlet(np.ones(5))
assert kl_div(P, Q) >= -1e-9 # always >= 0
print("KL ≥ 0 verified over 10,000 random pairs")
# ---- KL is NOT symmetric ----
P = [0.9, 0.1] # peaked (true posterior in Bayesian setting)
Q = [0.5, 0.5] # flat (prior)
print(f"KL(P||Q) = {kl_div(P,Q):.4f}")
print(f"KL(Q||P) = {kl_div(Q,P):.4f}")
# KL(P||Q) = 0.5108 KL(Q||P) = 0.4185 -- different!
# ---- KL = 0 iff P == Q ----
print(f"KL(P||P) = {kl_div(P,P):.6f}")
# 0.000000Forward vs. Reverse KL: Mode-Seeking vs. Mean-Seeking
The two directions of KL have fundamentally different behaviours when Q is used to approximate P. This is one of the most practically important distinctions in probabilistic ML.
import numpy as np
import matplotlib.pyplot as plt
# True distribution P: bimodal (two peaks at 2 and 8)
x = np.linspace(0, 10, 200)
P = 0.5*np.exp(-0.5*(x-2)**2) + 0.5*np.exp(-0.5*(x-8)**2)
P /= P.sum()
# Q: a single Gaussian parameterised by mu
def Q_gaussian(mu, sigma=1.0):
q = np.exp(-0.5*((x-mu)/sigma)**2)
return q / q.sum()
def kl(P, Q, eps=1e-12): return float(np.sum(P*np.log((P+eps)/(Q+eps))))
mus = np.linspace(0.5, 9.5, 100)
fwd = [kl(P, Q_gaussian(m)) for m in mus] # KL(P||Q): forward
rev = [kl(Q_gaussian(m), P) for m in mus] # KL(Q||P): reverse
fwd_opt = mus[np.argmin(fwd)]
rev_opt = mus[np.argmin(rev)]
print(f"Forward KL minimised at mu={fwd_opt:.1f}")
# mu ≈5.0 -- mean-seeking: splits the difference between modes
print(f"Reverse KL minimised at mu={rev_opt:.1f}")
# mu ≈2.0 or 8.0 -- mode-seeking: locks onto one modeJensen’s inequality is a two-line theorem with outsized consequences: it proves KL ≥ 0, justifies the ELBO lower bound in VAEs, and explains why the log of an average is always greater than the average of logs.
Statement
Geometric interpretation: the chord connecting any two points on a convex function lies above (or on) the curve. Therefore the average of function values exceeds the function of the average.
import numpy as np
np.random.seed(0)
# ---- Verify for convex f(x) = x^2 ----
X = np.random.randn(10000)
f = lambda x: x**2
lhs = f(X.mean()) # f(E[X])
rhs = np.mean(f(X)) # E[f(X)]
print(f"f(E[X]) = {lhs:.4f} E[f(X)] = {rhs:.4f} LHS <= RHS: {lhs <= rhs+1e-9}")
# f(E[X]) ≈0.0 E[f(X)] ≈1.0 LHS <= RHS: True (Jensen holds)
# ---- log is concave: E[log X] ≤ log(E[X]) ----
X_pos = np.abs(X) + 0.1 # keep positive for log
lhs_log= np.log(X_pos.mean()) # log(E[X]) (concave: this is BIGGER)
rhs_log= np.mean(np.log(X_pos)) # E[log X] (Jensen: this is SMALLER)
print(f"log(E[X]) = {lhs_log:.4f} E[log X] = {rhs_log:.4f} log(E) >= E[log]: {lhs_log >= rhs_log-1e-9}")
# ---- Prove KL >= 0 using Jensen ----
# -KL(P||Q) = E_{P}[log(Q/P)] = E_{P}[log r] where r = Q/P
# By Jensen (log concave): E[log r] ≤ log E[r] = log Σ P(x)*(Q(x)/P(x))
# = log Σ Q(x) = log 1 = 0
# Therefore -KL ≤ 0 ⇒ KL ≥ 0 ✔
P = np.array([0.1,0.3,0.4,0.2])
Q = np.array([0.2,0.2,0.3,0.3])
r = Q / P # ratio
e_log_r = np.sum(P * np.log(r)) # E_P[log r] = -KL(P||Q)
log_e_r = np.log(np.sum(P * r)) # log(E_P[r]) = log(1) = 0
print(f"E[log r] = {e_log_r:.4f} ≤ log(E[r]) = {log_e_r:.4f} ⇒ -KL ≤ 0 ⇒ KL ≥ 0")The ELBO: Jensen Applied to Variational Inference
The Evidence Lower Bound (ELBO) in Variational Autoencoders follows directly from Jensen’s inequality applied to the log-evidence. This is the derivation every VAE paper references but rarely shows fully.
The first step uses Jensen’s inequality (log is concave, so log E[·] ≥ E[log ·]). The ELBO lower-bounds the log-likelihood; maximizing the ELBO simultaneously learns the encoder q(z|x) and decoder P(x|z).
import torch, torch.nn as nn, torch.nn.functional as F
class SimpleVAE(nn.Module):
def __init__(self, input_dim=784, latent_dim=20):
super().__init__()
self.encoder_mu = nn.Linear(input_dim, latent_dim)
self.encoder_lv = nn.Linear(input_dim, latent_dim) # log-variance
self.decoder = nn.Linear(latent_dim, input_dim)
def forward(self, x):
# Encode: q(z|x) = N(mu, diag(exp(log_var)))
mu, log_var = self.encoder_mu(x), self.encoder_lv(x)
# Reparameterisation trick
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
z = mu + eps * std
# Decode: P(x|z)
x_hat = torch.sigmoid(self.decoder(z))
# ELBO = reconstruction_term - KL_term
recon_loss = F.binary_cross_entropy(x_hat, x, reduction='sum') # -E[log P(x|z)]
kl_loss = -0.5 * torch.sum(1 + log_var - mu**2 - log_var.exp()) # KL(q||p)
elbo = -(recon_loss + kl_loss) # maximise ELBO = minimise sum
return x_hat, recon_loss, kl_lossMutual information (MI) quantifies the reduction in uncertainty about X when we observe Y. It is symmetric, always non-negative, and zero if and only if X and Y are independent. It appears in contrastive learning, feature selection, and self-supervised representation objectives.
Definition and Decompositions
The KL form reveals the geometric meaning: MI measures how different the joint distribution is from the product of marginals. Zero mutual information means the joint factorises — the variables are independent.
import numpy as np
def mutual_information(joint, eps=1e-12):
"""
Compute I(X;Y) from a joint probability table.
joint: 2D numpy array where joint[i,j] = P(X=i, Y=j).
"""
joint = np.asarray(joint, float)
joint = joint / joint.sum() # normalise
px = joint.sum(axis=1, keepdims=True)
py = joint.sum(axis=0, keepdims=True)
product = px * py # P(X)*P(Y)
# I(X;Y) = KL(P_XY || P_X P_Y)
mask = joint > eps
return float(np.sum(joint[mask] * np.log(joint[mask]/product[mask])))
# ---- Independent variables: I = 0 ----
px_ind = np.array([0.4, 0.6])
py_ind = np.array([0.3, 0.7])
joint_ind = np.outer(px_ind, py_ind) # factorised
print(f"Independent: I(X;Y) = {mutual_information(joint_ind):.4f} nats")
# ≈0.0000 -- independent, zero MI
# ---- Correlated variables: I > 0 ----
joint_cor = np.array([[0.4, 0.05], [0.05, 0.50]])
print(f"Correlated: I(X;Y) = {mutual_information(joint_cor):.4f} nats")
# 0.5004 -- knowing one strongly predicts the other
# ---- Deterministic: I = H(X) ----
joint_det = np.array([[0.5, 0.0], [0.0, 0.5]]) # Y always = X
hx_det = -np.sum([0.5,0.5]*np.log([0.5,0.5]))
print(f"Deterministic: I = {mutual_information(joint_det):.4f} nats H(X) = {hx_det:.4f}")
# I = H(X) when Y is a deterministic function of XMutual Information in Contrastive Learning
Self-supervised representation learning methods like CLIP, SimCLR, and MoCo maximise a lower bound on mutual information between two views of the same data. The InfoNCE loss is this lower bound.
import torch
import torch.nn.functional as F
def info_nce_loss(z1, z2, temperature=0.07):
"""
InfoNCE (NT-Xent) loss: lower bound on I(z1; z2).
z1, z2: L2-normalised embeddings (N, D).
Positive pairs: (z1[i], z2[i]). Negatives: all other pairs.
"""
N = z1.shape[0]
# L2 normalise
z1 = F.normalize(z1, dim=1)
z2 = F.normalize(z2, dim=1)
# Cosine similarity matrix: (N, N)
logits = (z1 @ z2.T) / temperature # (N, N)
# Positives on diagonal; negatives are off-diagonal
labels = torch.arange(N, device=z1.device)
# Cross-entropy: each row is a N-way classification problem
loss = (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2
return loss
# Simulate: 64 samples, 128-dim embeddings
torch.manual_seed(0)
z1 = torch.randn(64, 128) # view 1 embeddings
z2 = z1 + 0.1 * torch.randn(64, 128) # view 2 (similar)
z2_rand = torch.randn(64, 128) # random view (unrelated)
print(f"InfoNCE (similar views): {info_nce_loss(z1, z2):.4f}")
print(f"InfoNCE (random views): {info_nce_loss(z1, z2_rand):.4f}")
# Similar views: low loss (high MI)
# Random views: log(N) = log(64) ≈4.16 (chance-level, zero MI)Conditional entropy H(X|Y) measures the remaining uncertainty in X after observing Y. The data processing inequality formalises a crucial fact: any deterministic transformation of data can only lose information, never create it.
Conditional Entropy
H(X|Y) ≤ H(X) always — conditioning reduces entropy (or leaves it unchanged). Equality holds iff X and Y are independent.
import numpy as np
def conditional_entropy(joint, eps=1e-12):
"""H(X|Y) from a joint probability table P(X,Y)."""
joint = np.asarray(joint, float); joint /= joint.sum()
py = joint.sum(axis=0, keepdims=True) # marginal P(Y)
cond = joint / (py + eps) # P(X|Y)
# H(X|Y) = -sum_{x,y} P(x,y) log P(x|y)
mask = joint > eps
return float(-np.sum(joint[mask] * np.log(cond[mask])))
# Joint table: Weather (rows) x Mood (cols)
joint = np.array([[0.40, 0.10], [0.15, 0.35]])
H_X = -np.sum(joint.sum(1)*np.log(joint.sum(1)+1e-12)) # H(Weather)
H_Y = -np.sum(joint.sum(0)*np.log(joint.sum(0)+1e-12)) # H(Mood)
H_XY = -np.sum(joint*np.log(joint+1e-12)) # H(Weather, Mood)
H_X_Y = conditional_entropy(joint) # H(Weather|Mood)
H_Y_X = conditional_entropy(joint.T) # H(Mood|Weather)
print(f"H(X) = {H_X:.4f}")
print(f"H(X|Y) = {H_X_Y:.4f} (reduced by observing Y)")
print(f"Chain rule: H(X|Y) = H(X,Y) - H(Y) = {H_XY-H_Y:.4f} ✔")
print(f"MI: I(X;Y) = H(X) - H(X|Y) = {H_X - H_X_Y:.4f}")Data Processing Inequality
If X → Y → Z forms a Markov chain (Z depends on X only through Y), then I(X;Z) ≤ I(X;Y). Processing data can only reduce the mutual information.
import torch
import torch.nn as nn
# Toy example: information about class label Y in representations
# X → layer1 → layer2 → layer3
# By DPI: I(Y; layer3) ≤ I(Y; layer2) ≤ I(Y; layer1) ≤ I(Y; X)
torch.manual_seed(42)
N = 1000
# 2-class input: class 0 ~ N(-1,1), class 1 ~ N(+1,1)
labels = torch.randint(0, 2, (N,))
X = (labels.float() * 2 - 1) + 0.8 * torch.randn(N, 1)
# Pass through layers of random (untrained) linear projections
layer1 = nn.Linear(1, 4, bias=False)
layer2 = nn.Linear(4, 16, bias=False)
layer3 = nn.Linear(16, 64, bias=False)
with torch.no_grad():
h1 = layer1(X)
h2 = layer2(h1)
h3 = layer3(h2)
# Estimate MI via a linear probe accuracy proxy
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
for name, rep in [('X',X), ('h1',h1),('h2',h2),('h3',h3)]:
acc = cross_val_score(LogisticRegression(), rep.numpy(), labels.numpy(), cv=3).mean()
print(f"{name:>4}: linear probe accuracy = {acc:.3f}")
# All ≈ same -- random projections don't lose class info (linear maps are invertible)
# But nonlinear activations with ReLU would reduce MI in early layersWe now have all the pieces to see language model training as an information-theoretic problem — not just an engineering task. This section ties together entropy, cross-entropy, KL divergence, and mutual information into a single coherent picture.
The Optimal Language Model
The data distribution P_data assigns probabilities to all possible sequences. The language model Q_θ is our parameterised approximation. The training objective is:
Since H(P_data) is fixed, minimising cross-entropy is exactly minimising KL divergence — bringing the model distribution as close as possible to the data distribution.
import torch
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2').eval()
tok = GPT2Tokenizer.from_pretrained('gpt2')
def analyse_text(text):
ids = tok.encode(text, return_tensors='pt')
with torch.no_grad():
logits = model(ids).logits[0] # (T, V)
probs = torch.softmax(logits, dim=-1).numpy()
tokens = [tok.decode([t]) for t in ids[0]]
# For each position (except last):
# surprisal, entropy, KL from uniform
results = []
for i in range(1, len(tokens)):
p = probs[i-1] # model dist at position i-1
true_id = ids[0, i].item()
surpris = -np.log(p[true_id]) # -log P(true token)
h = -np.sum(p * np.log(p + 1e-12)) # H[model at this step]
results.append((tokens[i], surpris, h))
avg_surpris = np.mean([r[1] for r in results])
perplexity = np.exp(avg_surpris)
avg_entropy = np.mean([r[2] for r in results])
print(f"Text: {text[:60]}...")
print(f"Avg surprisal: {avg_surpris:.3f} nats Perplexity: {perplexity:.1f}")
print(f"Avg entropy: {avg_entropy:.3f} nats (model confidence per step)")
print(f"Top-5 most surprising tokens:")
for tok_text, surpris, h in sorted(results, key=lambda r: -r[1])[:5]:
print(f" '{tok_text}': {surpris:.2f} nats (H={h:.2f})")
analyse_text("The transformer architecture was introduced by Vaswani et al. in 2017.")Reference Card
Exercises
Exercises 1–9 are pen-and-paper; 10–16 require code.
Further reading: “A Mathematical Theory of Communication” (Shannon, 1948) — the original paper, surprisingly readable. “Elements of Information Theory” (Cover & Thomas, 2006) — the standard textbook, Chapters 1–5. 3Blue1Brown’s video “But what is information?” for geometric intuition.
Next: Chapter 5 — Numerical Methods & Stability. We bridge theory and practice: how floating-point arithmetic, log-sum-exp tricks, and mixed precision interact with everything we’ve learned.