Information Theory
Detailed solutions for the exercises in Chapter 4. Try solving them yourself before checking the answers.
Solution
Surprisal = −log₂(1/6) = log₂ 6 ≈ 2.585 bits. One die roll therefore carries ~2.585 bits. To accumulate exactly 1 bit you need an event of probability 1/2 — only about 1/2.585 ≈ 0.39 of a die roll's worth of information; equivalently, one fair coin flip is 1 bit while one die roll exceeds it.
Solution
H = −Σ (1/N) log(1/N) = −N·(1/N)·(−log N) = log N. It is the maximum because, by Jensen/Lagrange, entropy is maximized by the uniform distribution: any concentration of probability reduces uncertainty. Uniform = maximal ignorance = maximal entropy.
Solution
Since H(P,Q) = H(P) + KL(P‖Q) and KL(P‖Q) ≥ 0 (Gibbs' inequality), it follows immediately that H(P,Q) ≥ H(P), with equality iff Q = P. Coding data with the wrong distribution Q can never beat using its true distribution P — the information-theoretic basis for why minimizing cross-entropy drives Q toward P.
Solution
CE = −Σ_j y_j log p_j with p = softmax(z). Using ∂p_j/∂z_c = p_j(δ_{jc} − p_c) and Σ_j y_j = 1, the sum telescopes to ∂CE/∂z_c = p_c − y_c. The gradient is predicted-minus-true probability — zero exactly when the prediction matches the one-hot target.
Solution
In bits: H(P) = −(0.5log₂.5 + 0.3log₂.3 + 0.2log₂.2) ≈ 1.485. H(P,Q) = −(0.5log₂.6 + 0.3log₂.3 + 0.2log₂.1) ≈ 1.554. KL = 0.5log₂(.5/.6) + 0 + 0.2log₂(.2/.1) ≈ −0.131 + 0.200 = 0.069. Indeed H(P)+KL = 1.485+0.069 = 1.554 = H(P,Q). ✓
Solution
Surprisal of 0.001 = −log₂ 0.001 ≈ 9.97 bits (6.91 nats). Mean NLL (nats) = 0.05·(−ln 0.001) + 0.95·(−ln 0.5) = 0.345 + 0.658 = 1.004, so perplexity = e^{1.004} ≈ 2.73. The rare confident errors raise perplexity noticeably above the e^{0.693}=2 baseline of the common case.
Solution
For a concave f, E[f(X)] ≤ f(E[X]); log is concave. Then −KL(P‖Q) = Σ p log(q/p) = E_P[log(q/p)] ≤ log E_P[q/p] = log Σ q = log 1 = 0. Hence KL(P‖Q) ≥ 0, with equality iff q/p is constant, i.e. P = Q.
Solution
Mutual information is I(X;Y) = KL(p(x,y) ‖ p(x)p(y)). Since KL ≥ 0 with equality iff its two arguments are equal, I(X;Y) = 0 exactly when p(x,y) = p(x)p(y) for all x,y — the definition of independence. So zero mutual information is precisely statistical independence.
Solution
The ELBO = reconstruction − β·KL(q(z|x)‖p(z)). The KL term pulls the encoder's posterior toward the prior, preventing it from memorizing each input as an isolated code and encouraging a smooth, structured latent space. As β→∞ the KL dominates, forcing q(z|x) → p(z) for all x — 'posterior collapse', where the latent ignores the input and reconstruction fails.
Solution
Use H(P)=−Σ p log p, H(P,Q)=−Σ p log q, KL=Σ p log(p/q), guarding against log(0). For random normalized P,Q these match scipy.stats.entropy(P) and entropy(P,Q) (which returns KL) to numerical precision, confirming the identity H(P,Q)=H(P)+KL holds elementwise.
Solution
For a fixed true label, the binary cross-entropy as a function of the predicted p is minimized when p matches the truth (e.g. p→1 for a positive label, where CE→0) and grows without bound as the prediction approaches the wrong extreme (CE→∞). The plot is the familiar convex log-loss curve.
Solution
Both are convex with a zero at Q=P, but they rise differently: forward KL(P‖Q) penalizes Q under-covering P's mass (blows up as Q→[0,1] where P has mass), while reverse KL(Q‖P) penalizes Q placing mass where P is small. The curves are not mirror images — visual proof that KL is not a distance.
Solution
Fluent, common text yields low average surprisal (predictable tokens); technical text is moderate (rarer vocabulary, specialized phrasing); random words are high (little predictability). The per-token surprisal profile spikes at rare or unexpected tokens, visualizing where the model is 'surprised'.
Solution
InfoNCE is cross-entropy over similarity scores where the positive pair is the correct class among N. With random embeddings every candidate is equally likely, so loss ≈ log N = log 128 ≈ 4.85. With perfect embeddings (z1=z2) the positive similarity dominates and loss → 0. This is the contrastive objective behind CLIP and SimCLR (Chapters 8, 30).
Solution
As training proceeds CE loss falls, predictive entropy falls (the model grows more confident), and perplexity = exp(CE) falls in lockstep. They are tightly related: perplexity is the exponential of cross-entropy, and confident correct predictions reduce both entropy and loss together.
Solution
Higher β weights the KL term more, producing a more disentangled but lower-capacity latent: reconstruction loss rises and the KL term shrinks (the posterior hugs the prior). At very high β the latent collapses (ignores the input). Low β gives sharp reconstructions but a less structured, less disentangled latent space — the core β-VAE trade-off.