Solutions Appendix
Chapter 21

Evaluation During Pretraining

18 Solutions

Detailed solutions for the exercises in Chapter 21. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Define perplexity via cross-entropy; loss 2.3 nats → perplexity? Interpret.

Solution

Perplexity = exp(cross-entropy) = e^{2.3} ≈ 9.97 ≈ 10. It means the model is, on average, as uncertain as if choosing uniformly among ~10 equally-likely next tokens — its effective branching factor. Lower perplexity indicates sharper, more accurate next-token predictions.

Exercise 2Pen & Paper
Why is perplexity not comparable across tokenizers? What normalization fixes it?

Solution

Perplexity is per-token, but tokenizers split text into different numbers of tokens, so the same text has different token counts and thus incomparable per-token perplexities. Normalizing to bits-per-byte (or bits-per-character) — dividing total negative log-likelihood by the number of bytes/characters rather than tokens — makes the measure tokenizer-independent and comparable across models.

Exercise 3Pen & Paper
Four capabilities low perplexity does NOT guarantee; why prediction quality and each can diverge.

Solution

(1) Factual accuracy — fluent text can be confidently false. (2) Reasoning — predicting tokens well ≠ multi-step logical correctness. (3) Instruction-following — a base model predicts text but may not obey commands. (4) Calibration/honesty — low perplexity says nothing about knowing what it doesn't know. Perplexity measures average next-token prediction, which can be excellent while these higher-level behaviors fail, because they depend on more than local predictability.

Exercise 4Pen & Paper
Describe log-likelihood multiple-choice scoring; why do longer answers score lower, and how does normalization fix it?

Solution

Each answer choice is scored by the model's total log-probability of its tokens given the question; the highest-scoring choice is selected. Longer answers multiply more (sub-1) probabilities, so their raw summed log-likelihood is lower simply for being longer — biasing toward short answers. Per-token (length) normalization — dividing by the number of tokens — removes this length bias, comparing average per-token likelihood instead.

Exercise 5Pen & Paper
Zero-shot vs 5-shot; why must a benchmark number state the shot count?

Solution

Zero-shot gives only the question; 5-shot prepends 5 worked examples in the prompt (in-context learning), which usually raises accuracy substantially, especially for base models that infer the task format from the examples. Because the same model scores very differently under different shot counts, a benchmark number is meaningless without stating the shots — comparisons are only valid at matched shot counts.

Exercise 6Pen & Paper
Trace the contamination pathway; why does n-gram decontamination miss paraphrases?

Solution

A published benchmark's questions get scraped into web crawls → included in pretraining data → the model memorizes them → it scores high by recall, not capability (inflated). N-gram decontamination removes training text sharing long exact n-grams with test items, but a PARAPHRASED question conveys the same content with different wording, sharing no long exact n-grams — so it slips through, and the model can still have effectively trained on the test.

Exercise 7Pen & Paper
Define Expected Calibration Error; sketch reliability diagrams for well-calibrated vs overconfident.

Solution

ECE bins predictions by confidence and averages |confidence − accuracy| over bins (weighted by bin size) — the average gap between how sure the model is and how often it's right. A well-calibrated model's reliability diagram (accuracy vs confidence) hugs the diagonal; an overconfident model's curve sits BELOW the diagonal (high confidence, lower accuracy), bowing away from it.

Exercise 8Pen & Paper
Why are base models often well-calibrated while RLHF degrades calibration? Implication for evaluation.

Solution

A base model trained purely on likelihood tends to report probabilities that match empirical frequencies (well-calibrated). RLHF optimizes for human-preferred, confident-sounding answers, which pushes the model toward overconfidence regardless of correctness — degrading calibration. Implication: base and aligned models should be evaluated differently; an aligned model's stated confidence is less trustworthy, so calibration must be measured separately rather than assumed.

Exercise 9Pen & Paper
Explain Goodhart's law for LLM benchmarks; example of optimizing MMLU corrupting its validity.

Solution

Goodhart: 'when a measure becomes a target, it ceases to be a good measure.' If developers optimize directly for MMLU — training on MMLU-like questions, tuning prompts to its format, or leaking its data — the score rises without the underlying knowledge improving, so MMLU no longer measures general capability. The benchmark becomes gamed: a high score reflects optimization to the test, not the broad competence it was meant to proxy.

Exercise 10Pen & Paper
Frame 'when to stop training' as an economic decision balancing training cost vs inference savings.

Solution

Each additional training token costs compute now but yields a better (or, if you instead shrink the model, cheaper-to-serve) model. Past the Chinchilla point, continued training buys diminishing loss improvements; the decision to stop weighs the marginal training cost against the lifetime inference savings of a better/smaller model (Chapter 16's Exercise 18). For heavily-served models, training longer pays off; for rarely-served ones, it doesn't — it's a total-cost-of-ownership calculation.

Exercise 11Code
Implement perplexity; report token perplexity and bits-per-byte on a held-out corpus.

Solution

Compute mean negative log-likelihood per token (exponentiate for perplexity) and per byte (divide total NLL in bits by byte count for bits-per-byte). Reporting both shows perplexity for intuition and bits-per-byte for the tokenizer-independent comparison of Exercise 2.

Exercise 12Code
Implement multiple-choice scoring with and without per-token normalization; compare on HellaSwag.

Solution

Scoring each ending by total vs per-token log-likelihood and measuring accuracy on a HellaSwag subset shows length normalization usually improves accuracy by removing the bias toward shorter endings (Exercise 4) — a concrete demonstration of why normalization matters for fair scoring.

Exercise 13Code
Implement zero-shot and 5-shot evaluation; measure the accuracy gap; discuss why few-shot helps.

Solution

Running both regimes on a multiple-choice task typically shows 5-shot beating zero-shot, especially for base models, because the in-context examples teach the task format and prime the model (Exercise 5). The measured gap quantifies the value of in-context learning.

Exercise 14Code Lab
Build an eval harness (validation perplexity + benchmark accuracy) and log it periodically in training.

Solution

Integrating periodic evaluation into the training loop and logging perplexity and benchmark accuracy lets you watch capabilities develop and catch regressions — the standard instrumentation for monitoring a pretraining run (complementing Chapter 15's dashboard).

Exercise 15Code
Implement ECE and a reliability diagram; compute ECE for multiple-choice predictions; interpret.

Solution

Binning predictions by confidence, plotting accuracy vs confidence, and computing the weighted gap gives ECE and the reliability diagram of Exercise 7. A diagram bowing below the diagonal with high ECE indicates overconfidence — informing whether to apply calibration (temperature scaling) before trusting the probabilities.

Exercise 16Code
Simulate contamination: train a tiny model with and without test questions; show inflated score.

Solution

Including the test items in training data lets the model memorize them, inflating its test score versus the clean model — a direct, reproducible demonstration of the contamination pathway in Exercise 6 and why decontamination is essential for honest evaluation.

Exercise 17Code
Track metrics over training: log perplexity and a benchmark every K steps; find where benchmark gains plateau vs loss.

Solution

Plotting both curves often shows perplexity (loss) continuing to fall smoothly while benchmark accuracy plateaus earlier — illustrating that lower loss does not translate indefinitely into better task performance (Exercise 3), an important signal for deciding when further training stops being worthwhile.

Exercise 18Code (Challenge)
Build a mini eval suite (perplexity + 2 MC benchmarks + calibration); run on two checkpoints; recommend whether to continue training.

Solution

Running the suite on an early and a later checkpoint produces a comparison report: improving perplexity and benchmark accuracy with stable/ improving calibration argues for continued training; flat benchmarks and degrading calibration argue for stopping. The recommendation should weigh the marginal gains against cost — the practical, evidence-based version of Exercise 10's economic framing.