Evaluation During Pretraining
Detailed solutions for the exercises in Chapter 21. Try solving them yourself before checking the answers.
Solution
Perplexity = exp(cross-entropy) = e^{2.3} ≈ 9.97 ≈ 10. It means the model is, on average, as uncertain as if choosing uniformly among ~10 equally-likely next tokens — its effective branching factor. Lower perplexity indicates sharper, more accurate next-token predictions.
Solution
Perplexity is per-token, but tokenizers split text into different numbers of tokens, so the same text has different token counts and thus incomparable per-token perplexities. Normalizing to bits-per-byte (or bits-per-character) — dividing total negative log-likelihood by the number of bytes/characters rather than tokens — makes the measure tokenizer-independent and comparable across models.
Solution
(1) Factual accuracy — fluent text can be confidently false. (2) Reasoning — predicting tokens well ≠ multi-step logical correctness. (3) Instruction-following — a base model predicts text but may not obey commands. (4) Calibration/honesty — low perplexity says nothing about knowing what it doesn't know. Perplexity measures average next-token prediction, which can be excellent while these higher-level behaviors fail, because they depend on more than local predictability.
Solution
Each answer choice is scored by the model's total log-probability of its tokens given the question; the highest-scoring choice is selected. Longer answers multiply more (sub-1) probabilities, so their raw summed log-likelihood is lower simply for being longer — biasing toward short answers. Per-token (length) normalization — dividing by the number of tokens — removes this length bias, comparing average per-token likelihood instead.
Solution
Zero-shot gives only the question; 5-shot prepends 5 worked examples in the prompt (in-context learning), which usually raises accuracy substantially, especially for base models that infer the task format from the examples. Because the same model scores very differently under different shot counts, a benchmark number is meaningless without stating the shots — comparisons are only valid at matched shot counts.
Solution
A published benchmark's questions get scraped into web crawls → included in pretraining data → the model memorizes them → it scores high by recall, not capability (inflated). N-gram decontamination removes training text sharing long exact n-grams with test items, but a PARAPHRASED question conveys the same content with different wording, sharing no long exact n-grams — so it slips through, and the model can still have effectively trained on the test.
Solution
ECE bins predictions by confidence and averages |confidence − accuracy| over bins (weighted by bin size) — the average gap between how sure the model is and how often it's right. A well-calibrated model's reliability diagram (accuracy vs confidence) hugs the diagonal; an overconfident model's curve sits BELOW the diagonal (high confidence, lower accuracy), bowing away from it.
Solution
A base model trained purely on likelihood tends to report probabilities that match empirical frequencies (well-calibrated). RLHF optimizes for human-preferred, confident-sounding answers, which pushes the model toward overconfidence regardless of correctness — degrading calibration. Implication: base and aligned models should be evaluated differently; an aligned model's stated confidence is less trustworthy, so calibration must be measured separately rather than assumed.
Solution
Goodhart: 'when a measure becomes a target, it ceases to be a good measure.' If developers optimize directly for MMLU — training on MMLU-like questions, tuning prompts to its format, or leaking its data — the score rises without the underlying knowledge improving, so MMLU no longer measures general capability. The benchmark becomes gamed: a high score reflects optimization to the test, not the broad competence it was meant to proxy.
Solution
Each additional training token costs compute now but yields a better (or, if you instead shrink the model, cheaper-to-serve) model. Past the Chinchilla point, continued training buys diminishing loss improvements; the decision to stop weighs the marginal training cost against the lifetime inference savings of a better/smaller model (Chapter 16's Exercise 18). For heavily-served models, training longer pays off; for rarely-served ones, it doesn't — it's a total-cost-of-ownership calculation.
Solution
Compute mean negative log-likelihood per token (exponentiate for perplexity) and per byte (divide total NLL in bits by byte count for bits-per-byte). Reporting both shows perplexity for intuition and bits-per-byte for the tokenizer-independent comparison of Exercise 2.
Solution
Scoring each ending by total vs per-token log-likelihood and measuring accuracy on a HellaSwag subset shows length normalization usually improves accuracy by removing the bias toward shorter endings (Exercise 4) — a concrete demonstration of why normalization matters for fair scoring.
Solution
Running both regimes on a multiple-choice task typically shows 5-shot beating zero-shot, especially for base models, because the in-context examples teach the task format and prime the model (Exercise 5). The measured gap quantifies the value of in-context learning.
Solution
Integrating periodic evaluation into the training loop and logging perplexity and benchmark accuracy lets you watch capabilities develop and catch regressions — the standard instrumentation for monitoring a pretraining run (complementing Chapter 15's dashboard).
Solution
Binning predictions by confidence, plotting accuracy vs confidence, and computing the weighted gap gives ECE and the reliability diagram of Exercise 7. A diagram bowing below the diagonal with high ECE indicates overconfidence — informing whether to apply calibration (temperature scaling) before trusting the probabilities.
Solution
Including the test items in training data lets the model memorize them, inflating its test score versus the clean model — a direct, reproducible demonstration of the contamination pathway in Exercise 6 and why decontamination is essential for honest evaluation.
Solution
Plotting both curves often shows perplexity (loss) continuing to fall smoothly while benchmark accuracy plateaus earlier — illustrating that lower loss does not translate indefinitely into better task performance (Exercise 3), an important signal for deciding when further training stops being worthwhile.
Solution
Running the suite on an early and a later checkpoint produces a comparison report: improving perplexity and benchmark accuracy with stable/ improving calibration argues for continued training; flat benchmarks and degrading calibration argue for stopping. The recommendation should weigh the marginal gains against cost — the practical, evidence-based version of Exercise 10's economic framing.