Part V: Alignment & RLHF

Chapter 21

Evaluation During Pretraining

Perplexity, downstream benchmarks like MMLU and HellaSwag, calibration, contamination, and reading the signals of a months-long run to decide when to continue, adjust, or stop.

18 Exercises

Learning Objectives

1.	Compute and interpret perplexity, and understand its uses and limits.
2.	Score multiple-choice benchmarks by log-likelihood and understand zero/few-shot evaluation.
3.	Know the major benchmarks (MMLU, HellaSwag, BIG-Bench) and what each measures.
4.	Detect and reason about benchmark contamination.
5.	Understand calibration and why it degrades with alignment.
6.	Build an evaluation harness and track metrics during a training run.
7.	Decide when to stop training using loss, benchmarks, and compute budget.
8.	Appreciate the gap between benchmark scores and real-world capability.

A frontier pretraining run takes weeks or months and costs millions of dollars. Launching it blindly and hoping for the best is not an option. Evaluation during pretraining turns the run from a black box into an instrument panel: it tells you whether the model is learning, whether something has gone wrong, and when there is no more to gain. This chapter — the close of Part IV — covers what to measure and how to read it.

Three Questions Evaluation Answers

•Is it working? The loss should fall smoothly; a stall or spike signals a problem to diagnose (Chapter 15).

•Is it improving in ways that matter? Falling loss should translate into rising capability on downstream tasks.

•When should we stop? At some point more compute yields diminishing returns, or the budget runs out, or the model is good enough.

These map onto two complementary kinds of metric. INTRINSIC metrics — loss and perplexity — measure how well the model predicts text, are cheap to compute continuously, and track the training objective directly. EXTRINSIC metrics — downstream benchmarks — measure actual capabilities, are more expensive, and are run periodically. A good evaluation strategy uses both.

Intrinsic metrics	Extrinsic metrics
Loss, perplexity	MMLU, HellaSwag, etc.
Measure prediction quality	Measure task capability
Cheap — run every step	Expensive — run periodically
Directly track the objective	Track what we actually care about
Smooth, predictable	Noisier, can jump
Can't be gamed by the model	Vulnerable to contamination

Perplexity is the exponential of the average cross-entropy loss (Chapter 4). It is the primary intrinsic metric for language models: the standard, interpretable way to express how well a model predicts held-out text. Lower is better.

text•Perplexity
PPL = exp(L) = exp( -(1/T) Σₜ log P(xₜ | x_<ₜ) )

L = average cross-entropy loss in nats
PPL = the effective branching factor: 'how many equally-likely
       next tokens is the model choosing among?'

A perplexity of 1 means perfect prediction (the model always assigns probability 1 to the correct token). A perplexity equal to the vocabulary size means random guessing. A well-trained LLM reaches single-digit to low-double-digit perplexity on typical English text — it is effectively choosing among only a handful of plausible next tokens at each step.

Python•Computing perplexity
import torch; import torch.nn.functional as F; import math

def perplexity(model, tokens):  # tokens: (B, T+1)
    """Perplexity on a batch of token sequences."""
    inputs, targets = tokens[:, :-1], tokens[:, 1:]
    with torch.no_grad():
        logits = model(inputs)                 # (B, T, V)
        loss = F.cross_entropy(
            logits.reshape(-1, logits.size(-1)),
            targets.reshape(-1),
        )
    return math.exp(loss.item())

# Always report which dataset and tokenizer the perplexity is on:
# PPL is only comparable across models using the SAME tokenizer and
# the SAME evaluation text. A model with a bigger vocabulary will have
# a different PPL on the same text -- it is NOT comparable across tokenizers.

⚠️

Pitfall: Perplexity Is Not Comparable Across Tokenizers

A model's perplexity depends on its tokenizer. A model with a 128k vocabulary splits text into fewer, larger tokens than a 32k-vocab model, so the per-token perplexities are not directly comparable — they are predicting different units. Comparing the raw perplexity of two models with different tokenizers is meaningless.

To compare across tokenizers, normalize to bits-per-byte or bits-per-character (perplexity per unit of underlying text, not per token). This is why papers report bits-per-byte for cross-model comparison and reserve token perplexity for tracking a single model over training.

Perplexity is essential but insufficient. It measures how well the model predicts text on average — but a model can predict text well while being poor at the tasks we actually care about. The relationship between perplexity and capability is strong but loose, and the gaps matter.

What Perplexity Misses

•Task ability: predicting the next token of a math problem well does not mean solving it correctly.

•Reasoning: perplexity rewards local fluency, not multi-step logical coherence.

•Factuality: a confidently-stated falsehood can have low perplexity if it is fluent.

•Instruction following: a base model with great perplexity may ignore instructions entirely (that needs alignment, Part V).

•Rare but critical capabilities: averaged over text, a rare crucial skill barely moves perplexity.

✧

Eval Note: The Emergence Connection

Recall the emergent-abilities debate (Chapter 16): perplexity improves smoothly, but some downstream capabilities appear to jump at certain scales. This is precisely why perplexity alone is insufficient — a smoothly-falling loss can hide a capability that is about to switch on, or one that has stalled despite the loss still dropping.

Perplexity tells you the model is learning to predict text better. It does not tell you WHICH capabilities are improving. For that you need downstream benchmarks — the subject of the rest of this chapter.

▶

ML Connection: Perplexity Predicts, Benchmarks Confirm

In practice, perplexity and benchmarks play complementary roles. Perplexity (and the scaling laws of Chapter 16) PREDICTS that a model will be capable — it is the cheap, continuous signal that the run is on track. Benchmarks CONFIRM which specific capabilities have actually materialized — the expensive, periodic check that the predicted capability is real.

A healthy run shows perplexity falling continuously and benchmark scores rising in periodic checkpoints. A divergence between them — falling loss but flat benchmarks — is a signal worth investigating.

Downstream benchmarks measure specific capabilities by posing tasks and scoring the model's answers. They are how the field tracks and compares model capability. There are hundreds; here are the most influential and what each probes.

Benchmark	Measures	Format
MMLU	Broad knowledge (57 subjects)	4-way multiple choice
HellaSwag	Commonsense sentence completion	4-way multiple choice
ARC	Grade-school science reasoning	Multiple choice
WinoGrande	Coreference / commonsense	Binary choice
GSM8K	Grade-school math word problems	Free-form numeric answer
HumanEval	Python code generation	Functional unit tests
TruthfulQA	Resistance to false beliefs	Multiple choice / generation
BIG-Bench	200+ diverse tasks	Mixed formats

MMLU: The Knowledge Standard

MMLU (Massive Multitask Language Understanding; Hendrycks et al., 2021) is the most-cited knowledge benchmark: 57 subjects from elementary mathematics to professional law, each a four-way multiple-choice question. It became the standard headline number for comparing models' breadth of knowledge. Random guessing scores 25%; strong models exceed 80%.

HellaSwag and Commonsense

HellaSwag (Zellers et al., 2019) tests commonsense by asking the model to choose the most plausible continuation of a scenario from four options, three of which are adversarially-generated distractors that are grammatical but nonsensical. It probes whether the model has a coherent world model, not just fluency.

✧

Eval Note: Benchmarks Saturate

A benchmark's useful life is limited: once models reliably exceed ~90%, it can no longer distinguish them, and the field moves on. MMLU, once a frontier challenge, is now nearly saturated by top models, prompting harder successors (MMLU-Pro, GPQA). HellaSwag and others have followed the same arc.

This benchmark treadmill is a permanent feature of the field. As models improve, benchmarks must get harder to remain informative — which is why the set of 'standard' benchmarks keeps changing and why a high score on a saturated benchmark says little.

How exactly do you score a base model on a multiple-choice question? The model does not 'pick' an answer — it assigns probabilities to token sequences. The standard method is log-likelihood scoring: present the question with each candidate answer, measure the model's probability of generating that answer, and choose the highest. The details of this scoring matter more than one might expect.

Log-Likelihood Scoring

text•Multiple-choice scoring by log-likelihood (Pseudocode)
# For a question Q with candidate answers A1 ... Ak:
for each candidate A_i:
    score_i = sum of log P(token | previous)  over A_i's tokens
    given the context Q

prediction = argmax_i score_i
# Variants normalize by answer length or by answer's prior probability

A crucial subtlety: longer answers have lower total log-likelihood simply because they have more tokens, each contributing negative log-probability. Different normalization schemes — raw sum, per-token average, or normalization by the answer's unconditional probability — can change which answer wins, and therefore the benchmark score. This is why the same model can score differently in different evaluation harnesses.

Python•Log-likelihood multiple-choice scoring
import torch; import torch.nn.functional as F

def score_choice(model, context_ids, answer_ids):
    """Sum of log P(answer | context), token by token."""
    ids = torch.cat([context_ids, answer_ids])[
        None]  # (1, ctx+ans)
    with torch.no_grad():
        logits = model(ids[:, :-1])        # predict each next token
        logp = F.log_softmax(logits[0], dim=-1)
    # Sum log-prob over the ANSWER tokens only
    ans_start = len(context_ids) - 1
    total = 0.0
    for i, tok in enumerate(answer_ids):
        total += logp[ans_start + i, tok].item()
    return total

def answer_question(model, context, choices, tokenizer, normalize=True):
    ctx = tokenizer(context)
    scores = []
    for choice in choices:
        ans = tokenizer(choice)
        s = score_choice(model, ctx, ans)
        if normalize: s /= len(ans)       # per-token: avoids length bias
        scores.append(s)
    return int(torch.tensor(scores).argmax())

Zero-Shot vs Few-Shot

Benchmarks are run in different 'shot' settings. Zero-shot gives only the question. Few-shot (e.g. 5-shot) prepends a handful of solved examples to the prompt, letting the model infer the task format via in-context learning. Few-shot usually scores higher for base models, which is why papers must always state the shot count — a '5-shot MMLU' number is not comparable to a '0-shot' one.

⚠️

Pitfall: Evaluation Settings Make Numbers Incomparable

The same model can post very different benchmark scores depending on the harness, the shot count, the answer normalization, the exact prompt template, and whether answers are scored by log-likelihood or by generating a letter. Reported numbers from different sources are frequently not comparable for these reasons.

This is why reproducible evaluation harnesses (like EleutherAI's lm-evaluation-harness) matter: they fix these choices so scores are comparable. When you see a benchmark number, ask which harness, how many shots, and what normalization — without those, the number is nearly meaningless.

The gravest threat to benchmark validity is contamination: when the test questions (or close variants) appear in the training data. A contaminated model can score highly by memorization rather than capability, making the benchmark meaningless. Because pretraining data is scraped from the web — which contains benchmark questions, solutions, and discussions — contamination is pervasive and hard to eliminate.

How Contamination Happens

Pipeline Flow: The contamination pathway

1	Benchmark published	Test questions and answers posted online
2	Web absorbs it	Questions appear in articles, forums, GitHub, leaderboards
3	Crawl captures it	Common Crawl scrapes the pages containing test data
4	Training ingests it	Despite decontamination filters, some leaks through
5	Score inflated	Model recalls memorized answers, not true capability

Chapter 17 covered decontamination — removing n-gram-matching benchmark text from training data — but it is imperfect. Paraphrased questions, reformatted tables, and translated versions evade n-gram filters. And the more a benchmark is discussed online, the more it leaks. This is a structural problem that the field continually fights and never fully wins.

⚠️

Treat High Benchmark Scores with Suspicion

When a model posts a surprisingly high score on a popular benchmark, contamination is a leading hypothesis. Several published models have been found, after the fact, to have trained on benchmark data — sometimes inadvertently. A score that seems too good often is.

Defenses include: evaluating on benchmarks released AFTER the training-data cutoff (so they cannot have leaked), using private held-out test sets, perturbing questions to defeat memorization, and reporting the train/test overlap explicitly. Healthy skepticism toward leaderboard numbers is a professional necessity.

✧

Eval Note: The Test-After-Cutoff Trick

A clean way to measure true capability is to evaluate on data that did not exist when the model was trained — a benchmark released after the training cutoff, or freshly-written questions. If a model scores well on genuinely novel problems, the capability is real, not memorized.

This is why dynamic and frequently-refreshed benchmarks (and private evaluation sets) are increasingly preferred over static public ones. A static benchmark's validity decays the moment it is published and starts leaking onto the web.

Beyond whether a model is RIGHT, we care whether it KNOWS when it is right — its calibration. A well-calibrated model's stated confidence matches its actual accuracy: of the predictions it makes with 80% confidence, about 80% should be correct. Calibration matters for trustworthiness, for knowing when to defer, and for downstream decisions that depend on confidence.

Measuring Calibration

Calibration is measured by binning predictions by confidence and comparing each bin's average confidence to its actual accuracy. The Expected Calibration Error (ECE) summarizes the gap. A reliability diagram plots confidence against accuracy: a perfectly-calibrated model lies on the diagonal.

text•Expected Calibration Error
Bin predictions by confidence into M bins.

ECE = Σₘ (|Bₘ| / N) · | accuracy(Bₘ) - confidence(Bₘ) |

Perfect calibration: accuracy(Bₘ) = confidence(Bₘ) for every bin → ECE = 0

✧

Eval Note: Pretrained Models Are Well-Calibrated — Alignment Breaks It

A striking finding (OpenAI's GPT-4 report and others): BASE pretrained models are often remarkably well-calibrated — their token probabilities genuinely reflect their uncertainty. But ALIGNMENT (RLHF, Part V) tends to DEGRADE calibration: the aligned model becomes overconfident, stating answers with high confidence regardless of correctness.

This is a real cost of alignment worth understanding now: the very process that makes a model helpful and harmless can make it worse at expressing genuine uncertainty. Recovering calibration after alignment is an active research problem, and it is one reason base-model evaluation (this chapter) and aligned-model evaluation (Part V) tell different stories.

Python•Computing Expected Calibration Error
import numpy as np

def expected_calibration_error(confidences, correct, n_bins=10):
    """confidences: model's confidence per prediction; correct: 0/1 array."""
    bins = np.linspace(0, 1, n_bins + 1)
    ece, N = 0.0, len(confidences)
    for lo, hi in zip(bins[:-1], bins[1:]):
        mask = (confidences > lo) & (confidences <= hi)
        if mask.sum() == 0: continue
        bin_acc  = correct[mask].mean()        # actual accuracy in bin
        bin_conf = confidences[mask].mean()    # avg confidence in bin
        ece += (mask.sum() / N) * abs(bin_acc - bin_conf)
    return ece

# ECE near 0 = well calibrated. A base model might have ECE ~0.02;
# the same model after RLHF might have ECE ~0.15 (overconfident).

During training, evaluation must be automated and integrated into the loop. An evaluation harness runs a fixed suite of metrics at regular intervals, logs them, and surfaces them on a dashboard. The key design goals: cheap enough to run often, comprehensive enough to catch problems, and reproducible enough that numbers are comparable across checkpoints.

Python•Code Lab: an evaluation harness in the training loop
import math, torch

def evaluate(model, val_loader, benchmarks, step):
    """Run intrinsic + extrinsic evaluation, return a metrics dict."""
    model.eval()
    metrics = {'step': step}

    # 1. Intrinsic: validation perplexity (cheap, run often)
    total_loss, n = 0.0, 0
    with torch.no_grad():
        for batch in val_loader:
            total_loss += lm_loss(model, batch.cuda()).item(); n += 1
    metrics['val_ppl'] = math.exp(total_loss / n)

    # 2. Extrinsic: a few cheap benchmarks (run less often)
    for name, bench in benchmarks.items():
        correct = sum(answer_question(model, q.ctx, q.choices, tok) == q.label
                      for q in bench.questions)
        metrics[name] = correct / len(bench.questions)

    model.train()
    return metrics

# In the training loop:
if step % eval_every == 0 and rank == 0:
    m = evaluate(model, val_loader, benchmarks, step)
    log(m)                              # to wandb / tensorboard / file
    print(f"step {step}: val_ppl={m['val_ppl']:.2f} mmlu={m.get('mmlu',0):.3f}")

What to Track and How Often

Metric	Frequency	Cost
Training loss	Every step	Free (already computed)
Validation perplexity	Every ~100–1000 steps	Cheap (held-out forward passes)
Cheap benchmarks (HellaSwag)	Every few thousand steps	Moderate
Full benchmark suite (MMLU+)	At major checkpoints	Expensive
Generation samples	Periodically	Cheap, qualitative sanity check
Calibration (ECE)	At checkpoints	Moderate

✧

Eval Note: Don't Forget to Read the Samples

Amid all the quantitative metrics, periodically reading the model's actual generated text is invaluable. Numbers can hide qualitative problems — repetition loops, degenerate outputs, a subtle formatting bug — that are obvious the moment you read a sample. Generation samples are the cheapest, most underrated diagnostic.

Many experienced practitioners keep a fixed set of prompts and generate from them at every checkpoint, watching the outputs improve (or break) over the course of training. It is the qualitative complement to the quantitative dashboard.

The practical decision evaluation ultimately serves: when do you stop? Unlike small-model training, where you train to convergence and early-stop on validation loss, frontier pretraining is governed by a compute budget and the scaling laws of Chapter 16. The decision balances diminishing returns against cost.

The Stopping Signals

Signal	Interpretation
Compute budget exhausted	The most common reason — the plan said N tokens, you trained N tokens
Loss curve flattening	Diminishing returns; each additional token buys less
Benchmarks plateau	Capability gains have stalled even as loss inches down
Target metric reached	The model is good enough for its purpose
Validation loss rises	Overfitting (rare at scale, but possible with repeated data)
Inference-aware optimum	Trained past Chinchilla for cheaper serving (Ch. 16)

Recall from Chapter 16 that modern models are often trained FAR past the Chinchilla compute-optimal point — LLaMA-3 at ~214 tokens per parameter versus Chinchilla's ~20 — because the cheaper inference of a smaller, longer-trained model more than repays the extra training. So 'when to stop' is increasingly an economic decision (training cost vs lifetime inference savings) rather than a pure convergence decision.

The Annealing Phase

Many modern runs end with a deliberate annealing phase: the learning rate decays to near zero (the cosine tail of Chapter 15), and sometimes the data mixture shifts toward higher-quality sources for the final tokens. This final phase yields a disproportionate quality improvement, so the decision to stop is really a decision about when to BEGIN annealing — the run is planned around it.

✧

Eval Note: Stopping Is Planned, Not Discovered

At frontier scale, you do not train until you happen to notice diminishing returns — the entire run, including its length and annealing schedule, is PLANNED in advance using scaling-law forecasts (Chapter 16). You predict the final loss, decide the token budget, set the cosine schedule to reach zero at that budget, and execute. Evaluation during training confirms the plan is on track and catches problems, rather than determining the stopping point on the fly.

This is the synthesis of Part IV: curated data (Ch. 17), distributed across GPUs (Ch. 18), with the right architecture (Ch. 19), trained efficiently (Ch. 20), and monitored to confirm the scaling-law plan (this chapter). Pretraining is a planned, instrumented, industrial process — not a hopeful experiment.

A final, essential caution: benchmark scores are proxies for capability, not capability itself. The gap between a high leaderboard number and a genuinely useful model is real and important, and conflating the two leads to bad decisions.

Why the Gap Exists

•Benchmarks are narrow: MMLU multiple-choice knowledge is a thin slice of what 'capable' means.

•Format mismatch: real use is open-ended generation and dialogue, not multiple-choice.

•Contamination inflates: scores may reflect memorization, not the targeted skill (Section 21.6).

•Goodhart's law: when a benchmark becomes a target, optimizing for it stops measuring true capability.

•Missing dimensions: helpfulness, safety, calibration, and judgment are poorly captured by accuracy.

⚠️

Goodhart's Law in LLM Evaluation

'When a measure becomes a target, it ceases to be a good measure.' Once a benchmark becomes the headline number labs compete on, the incentive to optimize for it directly — through data selection, targeted fine-tuning, or worse, contamination — corrupts its validity. A benchmark is most informative BEFORE everyone is optimizing for it.

This is why the field continually needs fresh benchmarks, human evaluation, and held-out tests. No static benchmark survives contact with sustained optimization pressure. Treat any single number with the skepticism it deserves.

Beyond Benchmarks

This is why frontier evaluation increasingly relies on human preference judgments (LMSYS Chatbot Arena's head-to-head battles), capability-specific red-teaming, and held-out or private test sets. These resist contamination and capture dimensions — helpfulness, coherence, judgment — that multiple-choice accuracy misses. The aligned-model evaluation of Part V leans heavily on these human-centered methods, because the qualities that matter most in a deployed model are precisely the ones benchmarks capture worst.

Evaluation Quick-Reference

Metric/Concept	Measures	Key caveat
Perplexity	Prediction quality	Not comparable across tokenizers
MMLU	Broad knowledge	Saturating; contamination-prone
HellaSwag	Commonsense	Multiple-choice ≠ real use
Log-likelihood scoring	MC answer selection	Normalization changes results
Few-shot setting	In-context task ability	Shot count must be stated
Calibration (ECE)	Confidence vs accuracy	Alignment degrades it
Contamination	Test-in-train leakage	Inflates scores; hard to remove

Exercises

Exercises 1–10 are pen-and-paper; 11–18 require code.

✎

Exercise 1: Pen & Paper

Define perplexity in terms of cross-entropy. A model has loss 2.3 nats on held-out text — compute its perplexity and interpret the number.

✎

Exercise 2: Pen & Paper

Explain why perplexity is not comparable across models with different tokenizers. What normalization makes cross-model comparison valid?

✎

Exercise 3: Pen & Paper

List four capabilities that low perplexity does NOT guarantee. For each, explain why prediction quality and that capability can diverge.

✎

Exercise 4: Pen & Paper

Describe log-likelihood multiple-choice scoring. Why do longer answers get lower raw scores, and how does per-token normalization fix this?

✎

Exercise 5: Pen & Paper

Explain the difference between zero-shot and 5-shot evaluation. Why must a benchmark number always state the shot count to be meaningful?

✎

Exercise 6: Pen & Paper

Trace the contamination pathway from a published benchmark to an inflated score. Why does n-gram decontamination fail to catch paraphrased questions?

✎

Exercise 7: Pen & Paper

Define Expected Calibration Error. Sketch a reliability diagram for (a) a well-calibrated and (b) an overconfident model.

✎

Exercise 8: Pen & Paper

Why are base models often well-calibrated while RLHF degrades calibration? What does this imply about evaluating base vs aligned models?

✎

Exercise 9: Pen & Paper

Explain Goodhart's law in the context of LLM benchmarks. Give a concrete example of how optimizing for MMLU could corrupt its validity.

✎

Exercise 10: Pen & Paper

Modern models train far past the Chinchilla-optimal token count. Frame 'when to stop' as an economic decision balancing training cost against inference savings.

✎

Exercise 11: Code

Implement perplexity computation. Evaluate a small model on a held-out corpus and report both token perplexity and bits-per-byte.

✎

Exercise 12: Code

Implement log-likelihood multiple-choice scoring with and without per-token normalization. Run both on a small HellaSwag subset and compare the accuracies.

✎

Exercise 13: Code

Implement zero-shot and 5-shot evaluation on a multiple-choice task. Measure the accuracy difference and discuss why few-shot helps a base model.

✎

Exercise 14: Code Lab

Build an evaluation harness that computes validation perplexity plus accuracy on a small benchmark, and integrate it into a training loop with periodic logging.

✎

Exercise 15: Code

Implement Expected Calibration Error and plot a reliability diagram. Compute ECE for a model's multiple-choice predictions and interpret the result.

✎

Exercise 16: Code

Simulate contamination: train a tiny model with and without the test questions in its training data, and show the contaminated model's inflated score.

✎

Exercise 17: Code

Track metrics over training: train a small model, log perplexity and a benchmark every K steps, and plot both curves. Identify where benchmark gains plateau relative to loss.

✎

Exercise 18: Code (Challenge)

Build a mini reproducible eval suite (perplexity + 2 multiple-choice benchmarks + calibration) and run it on two model checkpoints from different training stages. Produce a comparison report, and write a recommendation on whether continued training is worthwhile based on the trends.

Further reading: “Measuring Massive Multitask Language Understanding” (Hendrycks et al., 2021, MMLU) and “HellaSwag” (Zellers et al., 2019). “Beyond the Imitation Game” (Srivastava et al., 2022, BIG-Bench). The EleutherAI lm-evaluation-harness for reproducible evaluation. “On Calibration of Modern Neural Networks” (Guo et al., 2017) and the GPT-4 technical report's calibration discussion. “Data Contamination” analyses (e.g. Sainz et al., 2023). The LMSYS Chatbot Arena for human-preference evaluation.

Part IV Complete: Pretraining LLMs

Ch. 17	Data Collection & Curation	Common Crawl, deduplication, quality and safety filtering, data mixing — the pipeline that determines what a model knows.
Ch. 18	Distributed Training	data, tensor, and pipeline parallelism, ZeRO and FSDP — spreading a run across thousands of GPUs.
Ch. 19	Architecture Variants	RoPE, RMSNorm, SwiGLU, GQA, FlashAttention, sparse attention — the refinements that define modern models.
Ch. 20	Efficient Training	the memory hierarchy, kernel fusion, Triton, recomputation, fp8, torch.compile — raising hardware utilization.
Ch. 21	Evaluation During Pretraining	perplexity, MMLU and friends, calibration, contamination — turning the run into an instrument you can steer.

You now have a fully pretrained base model: trained on curated trillions of tokens, distributed across thousands of GPUs, built from a modern architecture, optimized for efficiency, and evaluated throughout. But a base model is not yet useful. It predicts the next token brilliantly, yet it does not reliably follow instructions, refuse harmful requests, or behave the way we want — it merely continues text. Part V — Alignment & Post-training — closes that gap: supervised fine-tuning to teach the model to follow instructions (Chapter 22), reinforcement learning from human feedback to align it with human preferences (Chapter 23), the simpler DPO family that reframes alignment as classification (Chapter 24), reasoning and chain-of-thought training (Chapter 25), and the constitutional and safety methods that make a model trustworthy (Chapter 26). The raw capability you have built becomes a helpful, harmless, honest assistant.

✎ 18 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 20. Efficient Training Techniques

NextCh 22. Supervised Fine-Tuning

→

Evaluation During Pretraining

Learning Objectives

Why Evaluate During Training

Three Questions Evaluation Answers

Perplexity

The Limits of Perplexity

What Perplexity Misses

Downstream Benchmarks

MMLU: The Knowledge Standard

HellaSwag and Commonsense

How Multiple-Choice Benchmarks Are Scored

Log-Likelihood Scoring

Zero-Shot vs Few-Shot

Benchmark Contamination

How Contamination Happens

Pipeline Flow: The contamination pathway

Calibration

Measuring Calibration

Building an Evaluation Harness

What to Track and How Often

When to Stop Training

The Stopping Signals

The Annealing Phase

Benchmarks Versus Real Capability

Why the Gap Exists

Beyond Benchmarks

Chapter Summary & Exercises

Evaluation Quick-Reference

Exercises

Part IV Complete: Pretraining LLMs