Evaluation & Benchmarking
A frontier pretraining run takes weeks or months and costs millions of dollars. Launching it blindly and hoping for the best is not an option. Evaluation during pretraining turns the run from a black box into an instrument panel: it tells you whether the model is learning, whether something has gone wrong, and when there is no more to gain. This chapter — the close of Part IV — covers what to measure and how to read it.
Three Questions Evaluation Answers
These map onto two complementary kinds of metric. INTRINSIC metrics — loss and perplexity — measure how well the model predicts text, are cheap to compute continuously, and track the training objective directly. EXTRINSIC metrics — downstream benchmarks — measure actual capabilities, are more expensive, and are run periodically. A good evaluation strategy uses both.
| Intrinsic metrics | Extrinsic metrics |
|---|---|
| Loss, perplexity | MMLU, HellaSwag, etc. |
| Measure prediction quality | Measure task capability |
| Cheap — run every step | Expensive — run periodically |
| Directly track the objective | Track what we actually care about |
| Smooth, predictable | Noisier, can jump |
| Can't be gamed by the model | Vulnerable to contamination |
Perplexity is the exponential of the average cross-entropy loss (Chapter 4). It is the primary intrinsic metric for language models: the standard, interpretable way to express how well a model predicts held-out text. Lower is better.
PPL = exp(L) = exp( -(1/T) Σₜ log P(xₜ | x_<ₜ) )
L = average cross-entropy loss in nats
PPL = the effective branching factor: 'how many equally-likely
next tokens is the model choosing among?'A perplexity of 1 means perfect prediction (the model always assigns probability 1 to the correct token). A perplexity equal to the vocabulary size means random guessing. A well-trained LLM reaches single-digit to low-double-digit perplexity on typical English text — it is effectively choosing among only a handful of plausible next tokens at each step.
import torch; import torch.nn.functional as F; import math
def perplexity(model, tokens): # tokens: (B, T+1)
"""Perplexity on a batch of token sequences."""
inputs, targets = tokens[:, :-1], tokens[:, 1:]
with torch.no_grad():
logits = model(inputs) # (B, T, V)
loss = F.cross_entropy(
logits.reshape(-1, logits.size(-1)),
targets.reshape(-1),
)
return math.exp(loss.item())
# Always report which dataset and tokenizer the perplexity is on:
# PPL is only comparable across models using the SAME tokenizer and
# the SAME evaluation text. A model with a bigger vocabulary will have
# a different PPL on the same text -- it is NOT comparable across tokenizers.Perplexity is essential but insufficient. It measures how well the model predicts text on average — but a model can predict text well while being poor at the tasks we actually care about. The relationship between perplexity and capability is strong but loose, and the gaps matter.
What Perplexity Misses
Downstream benchmarks measure specific capabilities by posing tasks and scoring the model's answers. They are how the field tracks and compares model capability. There are hundreds; here are the most influential and what each probes.
| Benchmark | Measures | Format |
|---|---|---|
| MMLU | Broad knowledge (57 subjects) | 4-way multiple choice |
| HellaSwag | Commonsense sentence completion | 4-way multiple choice |
| ARC | Grade-school science reasoning | Multiple choice |
| WinoGrande | Coreference / commonsense | Binary choice |
| GSM8K | Grade-school math word problems | Free-form numeric answer |
| HumanEval | Python code generation | Functional unit tests |
| TruthfulQA | Resistance to false beliefs | Multiple choice / generation |
| BIG-Bench | 200+ diverse tasks | Mixed formats |
MMLU: The Knowledge Standard
MMLU (Massive Multitask Language Understanding; Hendrycks et al., 2021) is the most-cited knowledge benchmark: 57 subjects from elementary mathematics to professional law, each a four-way multiple-choice question. It became the standard headline number for comparing models' breadth of knowledge. Random guessing scores 25%; strong models exceed 80%.
HellaSwag and Commonsense
HellaSwag (Zellers et al., 2019) tests commonsense by asking the model to choose the most plausible continuation of a scenario from four options, three of which are adversarially-generated distractors that are grammatical but nonsensical. It probes whether the model has a coherent world model, not just fluency.
How exactly do you score a base model on a multiple-choice question? The model does not 'pick' an answer — it assigns probabilities to token sequences. The standard method is log-likelihood scoring: present the question with each candidate answer, measure the model's probability of generating that answer, and choose the highest. The details of this scoring matter more than one might expect.
Log-Likelihood Scoring
# For a question Q with candidate answers A1 ... Ak:
for each candidate A_i:
score_i = sum of log P(token | previous) over A_i's tokens
given the context Q
prediction = argmax_i score_i
# Variants normalize by answer length or by answer's prior probabilityA crucial subtlety: longer answers have lower total log-likelihood simply because they have more tokens, each contributing negative log-probability. Different normalization schemes — raw sum, per-token average, or normalization by the answer's unconditional probability — can change which answer wins, and therefore the benchmark score. This is why the same model can score differently in different evaluation harnesses.
import torch; import torch.nn.functional as F
def score_choice(model, context_ids, answer_ids):
"""Sum of log P(answer | context), token by token."""
ids = torch.cat([context_ids, answer_ids])[
None] # (1, ctx+ans)
with torch.no_grad():
logits = model(ids[:, :-1]) # predict each next token
logp = F.log_softmax(logits[0], dim=-1)
# Sum log-prob over the ANSWER tokens only
ans_start = len(context_ids) - 1
total = 0.0
for i, tok in enumerate(answer_ids):
total += logp[ans_start + i, tok].item()
return total
def answer_question(model, context, choices, tokenizer, normalize=True):
ctx = tokenizer(context)
scores = []
for choice in choices:
ans = tokenizer(choice)
s = score_choice(model, ctx, ans)
if normalize: s /= len(ans) # per-token: avoids length bias
scores.append(s)
return int(torch.tensor(scores).argmax())Zero-Shot vs Few-Shot
Benchmarks are run in different 'shot' settings. Zero-shot gives only the question. Few-shot (e.g. 5-shot) prepends a handful of solved examples to the prompt, letting the model infer the task format via in-context learning. Few-shot usually scores higher for base models, which is why papers must always state the shot count — a '5-shot MMLU' number is not comparable to a '0-shot' one.
The gravest threat to benchmark validity is contamination: when the test questions (or close variants) appear in the training data. A contaminated model can score highly by memorization rather than capability, making the benchmark meaningless. Because pretraining data is scraped from the web — which contains benchmark questions, solutions, and discussions — contamination is pervasive and hard to eliminate.
How Contamination Happens
Pipeline Flow: The contamination pathway
| 1 | Benchmark published | Test questions and answers posted online |
| 2 | Web absorbs it | Questions appear in articles, forums, GitHub, leaderboards |
| 3 | Crawl captures it | Common Crawl scrapes the pages containing test data |
| 4 | Training ingests it | Despite decontamination filters, some leaks through |
| 5 | Score inflated | Model recalls memorized answers, not true capability |
Chapter 17 covered decontamination — removing n-gram-matching benchmark text from training data — but it is imperfect. Paraphrased questions, reformatted tables, and translated versions evade n-gram filters. And the more a benchmark is discussed online, the more it leaks. This is a structural problem that the field continually fights and never fully wins.
Beyond whether a model is RIGHT, we care whether it KNOWS when it is right — its calibration. A well-calibrated model's stated confidence matches its actual accuracy: of the predictions it makes with 80% confidence, about 80% should be correct. Calibration matters for trustworthiness, for knowing when to defer, and for downstream decisions that depend on confidence.
Measuring Calibration
Calibration is measured by binning predictions by confidence and comparing each bin's average confidence to its actual accuracy. The Expected Calibration Error (ECE) summarizes the gap. A reliability diagram plots confidence against accuracy: a perfectly-calibrated model lies on the diagonal.
Bin predictions by confidence into M bins.
ECE = Σₘ (|Bₘ| / N) · | accuracy(Bₘ) - confidence(Bₘ) |
Perfect calibration: accuracy(Bₘ) = confidence(Bₘ) for every bin → ECE = 0import numpy as np
def expected_calibration_error(confidences, correct, n_bins=10):
"""confidences: model's confidence per prediction; correct: 0/1 array."""
bins = np.linspace(0, 1, n_bins + 1)
ece, N = 0.0, len(confidences)
for lo, hi in zip(bins[:-1], bins[1:]):
mask = (confidences > lo) & (confidences <= hi)
if mask.sum() == 0: continue
bin_acc = correct[mask].mean() # actual accuracy in bin
bin_conf = confidences[mask].mean() # avg confidence in bin
ece += (mask.sum() / N) * abs(bin_acc - bin_conf)
return ece
# ECE near 0 = well calibrated. A base model might have ECE ~0.02;
# the same model after RLHF might have ECE ~0.15 (overconfident).During training, evaluation must be automated and integrated into the loop. An evaluation harness runs a fixed suite of metrics at regular intervals, logs them, and surfaces them on a dashboard. The key design goals: cheap enough to run often, comprehensive enough to catch problems, and reproducible enough that numbers are comparable across checkpoints.
import math, torch
def evaluate(model, val_loader, benchmarks, step):
"""Run intrinsic + extrinsic evaluation, return a metrics dict."""
model.eval()
metrics = {'step': step}
# 1. Intrinsic: validation perplexity (cheap, run often)
total_loss, n = 0.0, 0
with torch.no_grad():
for batch in val_loader:
total_loss += lm_loss(model, batch.cuda()).item(); n += 1
metrics['val_ppl'] = math.exp(total_loss / n)
# 2. Extrinsic: a few cheap benchmarks (run less often)
for name, bench in benchmarks.items():
correct = sum(answer_question(model, q.ctx, q.choices, tok) == q.label
for q in bench.questions)
metrics[name] = correct / len(bench.questions)
model.train()
return metrics
# In the training loop:
if step % eval_every == 0 and rank == 0:
m = evaluate(model, val_loader, benchmarks, step)
log(m) # to wandb / tensorboard / file
print(f"step {step}: val_ppl={m['val_ppl']:.2f} mmlu={m.get('mmlu',0):.3f}")What to Track and How Often
| Metric | Frequency | Cost |
|---|---|---|
| Training loss | Every step | Free (already computed) |
| Validation perplexity | Every ~100–1000 steps | Cheap (held-out forward passes) |
| Cheap benchmarks (HellaSwag) | Every few thousand steps | Moderate |
| Full benchmark suite (MMLU+) | At major checkpoints | Expensive |
| Generation samples | Periodically | Cheap, qualitative sanity check |
| Calibration (ECE) | At checkpoints | Moderate |
The practical decision evaluation ultimately serves: when do you stop? Unlike small-model training, where you train to convergence and early-stop on validation loss, frontier pretraining is governed by a compute budget and the scaling laws of Chapter 16. The decision balances diminishing returns against cost.
The Stopping Signals
| Signal | Interpretation |
|---|---|
| Compute budget exhausted | The most common reason — the plan said N tokens, you trained N tokens |
| Loss curve flattening | Diminishing returns; each additional token buys less |
| Benchmarks plateau | Capability gains have stalled even as loss inches down |
| Target metric reached | The model is good enough for its purpose |
| Validation loss rises | Overfitting (rare at scale, but possible with repeated data) |
| Inference-aware optimum | Trained past Chinchilla for cheaper serving (Ch. 16) |
Recall from Chapter 16 that modern models are often trained FAR past the Chinchilla compute-optimal point — LLaMA-3 at ~214 tokens per parameter versus Chinchilla's ~20 — because the cheaper inference of a smaller, longer-trained model more than repays the extra training. So 'when to stop' is increasingly an economic decision (training cost vs lifetime inference savings) rather than a pure convergence decision.
The Annealing Phase
Many modern runs end with a deliberate annealing phase: the learning rate decays to near zero (the cosine tail of Chapter 15), and sometimes the data mixture shifts toward higher-quality sources for the final tokens. This final phase yields a disproportionate quality improvement, so the decision to stop is really a decision about when to BEGIN annealing — the run is planned around it.
A final, essential caution: benchmark scores are proxies for capability, not capability itself. The gap between a high leaderboard number and a genuinely useful model is real and important, and conflating the two leads to bad decisions.
Why the Gap Exists
Beyond Benchmarks
This is why frontier evaluation increasingly relies on human preference judgments (LMSYS Chatbot Arena's head-to-head battles), capability-specific red-teaming, and held-out or private test sets. These resist contamination and capture dimensions — helpfulness, coherence, judgment — that multiple-choice accuracy misses. The aligned-model evaluation of Part V leans heavily on these human-centered methods, because the qualities that matter most in a deployed model are precisely the ones benchmarks capture worst.
Evaluation Quick-Reference
| Metric/Concept | Measures | Key caveat |
|---|---|---|
| Perplexity | Prediction quality | Not comparable across tokenizers |
| MMLU | Broad knowledge | Saturating; contamination-prone |
| HellaSwag | Commonsense | Multiple-choice ≠ real use |
| Log-likelihood scoring | MC answer selection | Normalization changes results |
| Few-shot setting | In-context task ability | Shot count must be stated |
| Calibration (ECE) | Confidence vs accuracy | Alignment degrades it |
| Contamination | Test-in-train leakage | Inflates scores; hard to remove |
Exercises
Exercises 1–10 are pen-and-paper; 11–18 require code.
Further reading: “Measuring Massive Multitask Language Understanding” (Hendrycks et al., 2021, MMLU) and “HellaSwag” (Zellers et al., 2019). “Beyond the Imitation Game” (Srivastava et al., 2022, BIG-Bench). The EleutherAI lm-evaluation-harness for reproducible evaluation. “On Calibration of Modern Neural Networks” (Guo et al., 2017) and the GPT-4 technical report's calibration discussion. “Data Contamination” analyses (e.g. Sainz et al., 2023). The LMSYS Chatbot Arena for human-preference evaluation.
Part IV Complete: Pretraining LLMs
| Ch. 17 | Data Collection & Curation | Common Crawl, deduplication, quality and safety filtering, data mixing — the pipeline that determines what a model knows. |
| Ch. 18 | Distributed Training | data, tensor, and pipeline parallelism, ZeRO and FSDP — spreading a run across thousands of GPUs. |
| Ch. 19 | Architecture Variants | RoPE, RMSNorm, SwiGLU, GQA, FlashAttention, sparse attention — the refinements that define modern models. |
| Ch. 20 | Efficient Training | the memory hierarchy, kernel fusion, Triton, recomputation, fp8, torch.compile — raising hardware utilization. |
| Ch. 21 | Evaluation During Pretraining | perplexity, MMLU and friends, calibration, contamination — turning the run into an instrument you can steer. |
You now have a fully pretrained base model: trained on curated trillions of tokens, distributed across thousands of GPUs, built from a modern architecture, optimized for efficiency, and evaluated throughout. But a base model is not yet useful. It predicts the next token brilliantly, yet it does not reliably follow instructions, refuse harmful requests, or behave the way we want — it merely continues text. Part V — Alignment & Post-training — closes that gap: supervised fine-tuning to teach the model to follow instructions (Chapter 22), reinforcement learning from human feedback to align it with human preferences (Chapter 23), the simpler DPO family that reframes alignment as classification (Chapter 24), reasoning and chain-of-thought training (Chapter 25), and the constitutional and safety methods that make a model trustworthy (Chapter 26). The raw capability you have built becomes a helpful, harmless, honest assistant.