Probability & Statistics
Detailed solutions for the exercises in Chapter 3. Try solving them yourself before checking the answers.
Solution
P(both red) = (3/10)·(2/9) = 6/90 = 1/15 ≈ 0.0667. The second factor uses 2 reds left of 9 balls after one red is drawn (no replacement).
Solution
By Bayes: P(+) = 0.02·0.95 + 0.98·0.01 = 0.019 + 0.0098 = 0.0288. So P(spam|+) = 0.019/0.0288 ≈ 0.66. Even a 95%/99% test gives only ~66% confidence because spam is rare — the base-rate effect that makes false positives dominate for rare classes.
Solution
The log-likelihood for k successes in n trials is ℓ(p) = k·log p + (n−k)·log(1−p). Setting ℓ'(p) = k/p − (n−k)/(1−p) = 0 gives k(1−p) = (n−k)p, i.e. k = np, so p̂ = k/n. The MLE is simply the observed success frequency.
Solution
MAP maximizes log p(θ|data) = log p(data|θ) + log p(θ). For a Gaussian prior N(0,σ²I), −log p(θ) = ‖θ‖²/(2σ²) + const. Setting σ² = 1/(2λ) makes this term λ‖θ‖² — exactly L2 weight decay. So L2 regularization is MAP estimation under a zero-mean Gaussian prior; stronger regularization ⇔ tighter prior.
Solution
H([0.5,0.5]) = −2·0.5·log₂ 0.5 = 1 bit. H([0.9,0.1]) = −0.9log₂ 0.9 − 0.1log₂ 0.1 ≈ 0.137 + 0.332 = 0.469 bits. The uniform distribution has higher entropy because it is maximally uncertain; the skewed one is more predictable, carrying less information per sample.
Solution
KL = 0.4·log₂(0.4/0.5) + 0.6·log₂(0.6/0.5) = 0.4(−0.322) + 0.6(0.263) = −0.129 + 0.158 = 0.029 bits ≥ 0. The small positive value reflects the mild mismatch between P and Q; KL is zero only when P = Q.
Solution
KL(P‖Q) = Σ p log(p/q) = Σ p log p − Σ p log q. Adding H(P) = −Σ p log p cancels the first term, leaving −Σ p log q = H(P,Q). So cross-entropy = entropy + KL: the cost of coding P with Q's code is P's own entropy plus the penalty for the wrong code.
Solution
The per-token negative log-likelihood is −log(0.01). Perplexity is the geometric-mean inverse probability, so this token contributes a factor of 1/0.01 = 100 to the perplexity (equivalently NLL = 4.605 nats). A confidently wrong token sharply inflates perplexity.
Solution
As T→0, dividing logits by T amplifies differences without bound, so the largest logit's exponential dominates the denominator and its probability → 1 while all others → 0. The distribution collapses onto argmax — exactly greedy decoding. Conversely T→∞ flattens toward uniform.
Solution
Forward KL Σ p log(p/q) weights the mismatch by P; reverse KL weights by Q. Forward KL is 'mean-seeking' (penalizes Q putting low mass where P has mass — it covers all modes), used in maximum-likelihood training. Reverse KL is 'mode-seeking' (penalizes Q putting mass where P has none), used in variational inference and some RL objectives. Choosing the wrong direction changes which behaviour you get.
Solution
d/dp[p(1−p)] = 1−2p = 0 → p = 0.5, giving maximum variance 0.25. Variance is minimized (= 0) at p = 0 or p = 1, where the outcome is certain. Uncertainty, like entropy, peaks at the balanced point.
Solution
With p = softmax(z) and CE = −Σ_j y_j log p_j, the softmax Jacobian ∂p_j/∂z_i = p_j(δ_{ij}−p_i). Substituting and summing (using Σ y_j = 1) gives ∂CE/∂z_i = p_i − y_i. This clean form is why softmax + cross-entropy is the standard classification head: the gradient is just predicted minus target.
Solution
Estimate token probabilities by frequency, then perplexity = exp(mean NLL) on held-out text. A unigram model scores far better than a uniform random model (perplexity ≈ vocab size) because word frequencies are highly non-uniform, but far worse than models using context.
Solution
Estimate P(w_t | w_{t−1}) from adjacent-pair counts (with smoothing for unseen pairs), then sample the next token from that conditional repeatedly. The output is locally plausible but quickly loses coherence — the limit of a 1-token context window.
Solution
Greedy is deterministic and most repetitive (lowest distinct-n). Raising temperature and using top-k/top-p increases diversity (higher distinct-1, distinct-2) at some cost to coherence. top-p (nucleus) adapts the cutoff to the distribution's shape, usually giving the best diversity/quality balance — the standard default.
Solution
KL is convex in Q with a minimum of 0 at Q = P = [0.6,0.4]. As Q→[1,0] the term 0.4·log(0.4/0) → +∞, and similarly as Q→[0,1]. KL blows up whenever Q assigns near-zero probability to an outcome P deems possible — the 'infinite surprise' of ruling out something that happens.
Solution
Fit each degree on many bootstrap samples, then estimate bias² (squared gap of the mean prediction from truth) and variance (spread across fits). Low degrees show high bias, low variance (underfit); high degrees show low bias, high variance (overfit). Total error = bias² + variance + noise is U-shaped, minimized at an intermediate degree.
Solution
Temperature divides logits before softmax; top-k zeroes all but the k largest; top-p keeps the smallest set whose cumulative probability ≥ p, renormalizes, and samples. Edge cases: T=0.001 ≈ greedy; k=1 = greedy; p=0.01 keeps only the single most probable token (also ≈ greedy). All three degenerate to argmax in their extremes, confirming correctness.
Solution
Perplexity is lowest on fluent Wikipedia (the model predicts it well), higher on random English words (valid tokens but no coherent structure), and highest on random characters (off-distribution, near-uniform surprise). Perplexity tracks how 'expected' the text is under the model.
Solution
Estimate P(w_t | w_{t−2}, w_{t−1}) with add-k smoothing: (count + k)/(context_count + k·V). Sweep k (e.g. 0.01–1.0) and pick the value minimizing validation perplexity — a bias/variance trade-off: small k overfits sparse counts, large k over-smooths toward uniform. The optimum is typically small but nonzero.