Part V: Alignment & RLHF

Chapter 23

Reinforcement Learning from Human Feedback

Reward modeling from human preferences, a from-scratch tour of policy-gradient RL, the PPO training loop, KL penalties, GRPO, and why RLHF is so hard to get right.

22 Exercises

Learning Objectives

1.	Explain why imitation (SFT) is not enough and why preferences are easier to collect than demonstrations.
2.	Understand the three-stage RLHF pipeline: SFT, reward model, RL.
3.	Collect and reason about human preference data.
4.	Train a reward model using the Bradley-Terry preference loss.
5.	Understand policy-gradient RL from scratch: policy, reward, REINFORCE.
6.	Understand the PPO algorithm: the clipped objective, advantages, and the value model.
7.	Explain the KL penalty and how it prevents reward hacking.
8.	Assemble the full RLHF/PPO loop with its four models.
9.	Understand GRPO and how it removes the value model with group-relative advantages.
10.	Diagnose why RLHF is unstable and prone to reward hacking and over-optimization.

Chapter 22 turned a base model into an instruction-follower by imitation: we showed it good demonstrations and trained it to copy them. This works remarkably well, but it has a fundamental ceiling. This chapter is about breaking through that ceiling with a different kind of training signal — human PREFERENCES rather than demonstrations.

The Two Limits of Imitation

Supervised fine-tuning has two deep limitations that no amount of additional demonstrations can fully fix:

•It can only be as good as its demonstrations. The model learns to imitate the responses it is shown, so it cannot easily become BETTER than the humans (or models) who wrote them. Imitation has a quality ceiling set by the data.

•It never learns what NOT to do. SFT shows the model good responses, but it never shows the model a bad response and says 'this is worse'. The model has no signal distinguishing a great answer from a merely acceptable one — every demonstration is treated as equally correct.

The Key Realization: Comparing Is Easier Than Writing

Here is the insight that launched RLHF. For many qualities we care about — helpfulness, tone, harmlessness, nuance — it is HARD for a human to write the ideal response, but EASY to compare two responses and say which is better. Writing a perfect, balanced answer to a sensitive question is difficult; judging which of two answers is more balanced is quick and reliable.

✧

Preference Pair

Prompt: My friend is going through a hard time. What should I say?

Chosen: It is kind of you to want to support your friend. A good start is simply being present and listening without rushing to fix things. You might say something like 'I'm here for you, and you don't have to go through this alone.' Ask how you can help rather than assuming...

Rejected: Just tell them to cheer up and that everything happens for a reason. They'll get over it.

Almost anyone can instantly see that the left response is more helpful and empathetic than the right one — but writing the left response from scratch is a real skill. RLHF exploits this asymmetry: it collects easy human JUDGMENTS (which is better?) at scale, and uses them to push the model toward responses humans prefer, beyond what demonstrations alone could teach.

✧

Intuition: Preferences Carry a Richer Signal

A demonstration says 'this response is good.' A preference says 'this response is better than that one' — which implicitly contains information about a whole spectrum of quality. By learning from many such comparisons, the model learns a fine-grained sense of what makes responses better or worse, not just a single notion of 'acceptable'.

This is why RLHF can push past the SFT ceiling: it optimizes for being PREFERRED, which can exceed the quality of any single demonstration. The model can discover responses better than anything in its training data, guided by the preference signal.

RLHF, as introduced by InstructGPT (Ouyang et al., 2022) and used to build ChatGPT, Claude, and others, is a three-stage pipeline. Before diving into the math of each stage, it helps to see the whole shape, so you always know where you are.

Pipeline Flow: The three stages of RLHF

1	SFT model	Start from a supervised fine-tuned model (Chapter 22) — the initial 'policy'
2	Collect prefs	Show humans pairs of model responses; they pick the better one in each pair
3	Reward model	Train a model to predict human preferences — it outputs a scalar 'reward' for any response
4	RL optimize	Use RL (PPO or GRPO) to update the policy to produce responses the reward model scores highly

The Cast of Characters

RLHF involves several models with confusingly similar names. Let us name them all up front so the later sections are clear:

Model	Role
Policy	The model we are training — it generates responses. Starts as the SFT model.
Reference model	A FROZEN copy of the SFT model. Used to measure how far the policy drifts (the KL penalty).
Reward model	Predicts a scalar reward (how much humans would prefer this response). Frozen during RL.
Value model	(PPO only) Estimates expected future reward, used to compute advantages. Trained alongside the policy.

✧

Reward Note: Four Models in Memory at Once (for PPO)

A subtle practical point that surprises beginners: standard PPO-based RLHF keeps up to FOUR models in memory simultaneously — the policy (training), the reference (frozen), the reward model (frozen), and the value model (training). This is a big reason RLHF is memory-hungry and engineering-heavy.

GRPO (Section 23.10) eliminates the value model, cutting this to three. Keep this cast in mind as we build up each piece — by the end you will see exactly how the four interact in the training loop.

The fuel for RLHF is preference data. The process is straightforward: take a prompt, have the model generate two (or more) responses, and ask a human to pick the better one. Each judgment produces a preference pair — a 'chosen' response and a 'rejected' response for the same prompt. This is the same structure you saw in Chapter 22's prefPair visual, now the central training signal.

How Preferences Are Collected

text•Preference data collection (Pseudocode)
# Repeat for many prompts:
1. sample a prompt q (from real usage or a curated set)
2. generate two responses y_A, y_B from the current policy
3. show a human annotator q, y_A, y_B
4. human picks the better one  →  (chosen, rejected) pair

# Result: a dataset of (prompt, chosen, rejected) triples

Pairwise Comparison vs Direct Rating

Why compare two responses instead of just rating each on a scale of 1–10? Because humans are far more consistent at relative judgments than absolute ones. My '7/10' and your '7/10' may mean different things, and my own ratings drift over a long session. But 'A is better than B' is stable and consistent across annotators. Pairwise comparison sidesteps the calibration problem of absolute scores.

Pairwise comparison (used)	Direct rating (avoided)
'A is better than B'	'A scores 7/10'
Consistent across annotators	Annotators calibrate differently
Stable over a long session	Ratings drift over time
Easy, fast judgments	Requires an absolute scale
Yields preference pairs	Yields noisy scalar labels

The Quality and Cost of Preference Data

Preference data is expensive and its quality is paramount. Annotators must be trained, given clear guidelines (what counts as 'better'?), and monitored for agreement. Disagreement is inevitable — reasonable people prefer different responses — so the data is inherently noisy. The guidelines encode the values the model will learn: if annotators are told to prefer concise answers, the model becomes concise. The preference data IS the specification of desired behaviour.

⚠️

Annotation Guidelines Become Model Behaviour

Whatever the annotation guidelines emphasize, the model learns. If guidelines reward longer responses, the model becomes verbose. If they reward confident-sounding answers, the model becomes overconfident (one cause of the calibration loss from Chapter 21). The annotators' instructions are, in effect, the model's value system — written in prose and transmitted through preferences.

This makes writing good annotation guidelines a subtle, high-stakes task. Many surprising model behaviours trace back to a line in an annotation guideline. Designing the guidelines is as important as designing the algorithm.

✧

Reward Note: RLAIF: Replacing Human Labels with AI

Collecting human preferences at scale is slow and costly. RLAIF (RL from AI Feedback) replaces the human annotator with a strong AI model that judges which response is better, following a written rubric. This scales preference collection dramatically and is central to Constitutional AI (Chapter 26).

RLAIF works because judging is easier than generating — the same asymmetry from Section 23.1 — so a capable model can be a reliable judge even for responses near its own quality. Most modern alignment pipelines use a mix of human and AI feedback.

We cannot run reinforcement learning against a human — humans are far too slow to provide feedback on every one of the millions of responses RL generates. So we train a REWARD MODEL: a model that LEARNS to predict human preferences, then provides instant reward signals during RL. The reward model is the bridge from slow, expensive human judgments to fast, scalable training signal.

What the Reward Model Computes

A reward model takes a prompt and a response and outputs a single number — a scalar reward — representing how much a human would prefer that response. It is usually built by taking a pretrained model (often the SFT model), removing its token-prediction head, and adding a small head that outputs one number instead of a vocabulary distribution.

Arch Stack: Reward model: a Transformer with a scalar head

scalar reward r	one number
reward head	(d → 1) linear
final hidden state	(d,)
Transformer body	from the SFT model
prompt + response tokens	(T,)

The Bradley-Terry Model and the Reward Loss

How do we train a model to output rewards when our data is only PAIRWISE preferences (A is better than B), not absolute scores? The answer is the Bradley-Terry model, a classic statistical model of pairwise comparisons. It says: the probability that response A is preferred over B is the logistic function of the DIFFERENCE in their rewards.

text•Bradley-Terry preference model
P(A preferred over B) = σ( r(A) - r(B) )

where σ(x) = 1/(1+e⁻ˣ) is the logistic sigmoid, and
r(·) is the scalar reward the model assigns.

Bigger reward gap → more confident preference.

This gives us a training objective. For each preference pair (chosen y_w, rejected y_l), we want the reward model to assign a HIGHER reward to the chosen response. We maximize the probability of the observed preference, which means minimizing the following loss:

text•Reward model loss
L_RM = -E[ log σ( r(y_w) - r(y_l) ) ]

y_w = chosen (preferred) response
y_l = rejected response

# Pushes r(chosen) up and r(rejected) down, until their gap
# matches the strength of the human preference.

Python•Training a reward model from preference pairs
import torch; import torch.nn.functional as F

class RewardModel(torch.nn.Module):
    def __init__(self, base_model, d_model):
        super().__init__()
        self.body = base_model            # Transformer from the SFT model
        self.head = torch.nn.Linear(d_model, 1)  # outputs ONE number

    def forward(self, input_ids):
        h = self.body(input_ids).last_hidden_state  # (B, T, d)
        # Reward = scalar from the LAST token's hidden state
        return self.head(h[:, -1]).squeeze(-1)    # (B,)

def reward_loss(rm, chosen_ids, rejected_ids):
    """Bradley-Terry loss: chosen should score higher than rejected."""
    r_chosen   = rm(chosen_ids)        # (B,)
    r_rejected = rm(rejected_ids)      # (B,)
    # -log sigma(r_w - r_l): minimized when r_chosen >> r_rejected
    return -F.logsigmoid(r_chosen - r_rejected).mean()

# Train this like any classifier. After training, the reward model
# scores ANY (prompt, response) -- the instant feedback RL needs.

✧

Reward Note: The Reward Model Is an Imperfect Proxy

The reward model is the linchpin of RLHF — and its biggest weakness. It is a LEARNED, IMPERFECT approximation of human preferences, trained on limited data. It will have blind spots, biases, and exploitable quirks. The RL stage will optimize HARD against it, and any flaw becomes a target. This sets up the central difficulty of RLHF, which we return to in Section 23.7 and 23.9.

A useful mental model: the reward model is a stand-in judge. As long as the policy stays in regions where the judge is reliable, RL works. But push too far and the policy finds responses that fool the judge — high reward, low actual quality. Managing this is the whole game.

Now we have a reward model that scores responses. The RL stage uses that reward to improve the policy. But many readers have never studied reinforcement learning, so this section builds up just enough RL from scratch to understand RLHF. We will keep it concrete and tied to language models throughout.

Reframing Generation as a Sequence of Decisions

In RL terms, generating a response is a sequence of DECISIONS. At each step, the model is in a STATE (the prompt plus the tokens generated so far), takes an ACTION (choosing the next token), and eventually receives a REWARD (the reward model's score of the finished response). The model's strategy for choosing actions is called its POLICY — which is exactly the language model's probability distribution over next tokens.

RL term	In language modeling
Policy π	The language model itself — its distribution over next tokens
State	The prompt plus the tokens generated so far
Action	Choosing the next token
Trajectory	A full generated response (sequence of actions)
Reward	The reward model's score of the completed response
Return	The total reward for the trajectory (here, just the final reward)

The Goal: Maximize Expected Reward

The objective of RL is simple to state: adjust the policy so that the responses it generates get high reward, ON AVERAGE. Formally, we want to maximize the expected reward over the responses the policy produces:

text•The RL objective
maximize  J(θ) = E[ r(y) ]   over responses y sampled from policy πθ

θ = the policy's parameters
r(y) = the reward model's score of response y
# Make high-reward responses more likely, low-reward ones less likely.

The Policy Gradient: REINFORCE

How do we increase expected reward by gradient descent? The trick — the policy gradient theorem — gives a beautifully intuitive answer. To make high-reward responses more likely, increase the probability of the actions that led to them, weighted by how much reward they earned. The simplest version is the REINFORCE algorithm:

text•The REINFORCE policy gradient
∇θ J  =  E[ r(y) · ∇θ log πθ(y) ]

In words: nudge the policy to make response y MORE likely
in proportion to its reward r(y).
    high reward → push its probability UP
    low reward  → push its probability DOWN

✧

Intuition: REINFORCE Is Just 'Do More of What Worked'

Strip away the math and REINFORCE says something obvious: sample some responses, see which got high reward, and adjust the model to make those responses more likely next time. It is trial-and-error learning — generate, evaluate, reinforce the good. The gradient ∇ log π simply points in the direction that increases a response's probability; multiplying by the reward decides how hard to push and in which direction.

Everything more advanced — baselines, advantages, PPO's clipping — is about making this basic idea STABLE and SAMPLE-EFFICIENT. The core, 'do more of what earned reward,' never changes.

The Problem with Plain REINFORCE: Variance

Plain REINFORCE works but is extremely noisy. The reward r(y) varies wildly from sample to sample, so the gradient estimate jumps around, making training slow and unstable. The standard fix is a BASELINE: instead of weighting by the raw reward, weight by how much BETTER than average a response was. This 'advantage' — reward minus a baseline — has much lower variance and is the key idea connecting REINFORCE to PPO.

text•Advantage: reward relative to a baseline
A(y) = r(y) - b      (advantage = reward minus a baseline b)

∇θ J  =  E[ A(y) · ∇θ log πθ(y) ]

# A > 0: better than average → increase probability
# A < 0: worse than average → decrease probability
# Lower variance than using raw reward → more stable training.

✧

Reward Note: Where the Baseline Comes From

The baseline b should estimate the expected reward, so the advantage measures 'better or worse than expected'. PPO learns this baseline with a separate VALUE MODEL that predicts expected reward (Section 23.6). GRPO instead uses the average reward of a GROUP of samples for the same prompt as the baseline (Section 23.10) — simpler, and no extra model.

This single design choice — how to estimate the baseline — is the main difference between PPO and GRPO. Keep it in mind; it is the thread connecting the next two algorithm sections.

REINFORCE with a baseline is the conceptual foundation, but the algorithm actually used in classic RLHF is PPO (Proximal Policy Optimization; Schulman et al., 2017). PPO adds two crucial ingredients that make policy-gradient RL stable enough to train language models: a learned value model for the baseline, and a CLIPPED objective that prevents the policy from changing too fast.

Ingredient 1: The Value Model

PPO learns the baseline with a VALUE MODEL — a network (often sharing the body of the policy or reward model) that predicts the expected reward from a given state. The advantage is then the actual reward minus the value model's prediction: how much better the outcome was than expected. The value model is trained alongside the policy to predict rewards accurately.

Ingredient 2: The Clipped Objective

The danger in policy-gradient RL is taking too large a step — a big update can collapse the policy into producing garbage, from which it never recovers. PPO prevents this by CLIPPING: it limits how much the policy's probability for an action can change in a single update. If an update would change a token's probability by more than a small factor, the change is clipped.

text•The PPO clipped objective
ratio  ρ = πθ(a|s) / π_old(a|s)        # how much the policy changed

L_PPO = E[ min( ρ · A,  clip(ρ, 1-ε, 1+ε) · A ) ]

ε ≈ 0.2 is the clip range.
# The clip caps the update: even if A is large, the policy can't
# move more than ±ε, keeping each step small and safe.

The min-of-two-terms looks cryptic but is doing something simple: it takes the more pessimistic (smaller) of the unclipped and clipped objectives, which removes the incentive to push the policy ratio far beyond 1±ε. The effect is that PPO improves the policy in small, safe steps — 'proximal' means it stays close to the previous policy each update.

text•PPO for RLHF (one iteration) (Pseudocode)
# Given: policy, value model, frozen reward model, frozen reference
1. sample a batch of prompts
2. generate responses from the current policy
3. score responses with the reward model  → rewards
4. compute advantages A using the value model as baseline
5. for several mini-epochs over this batch:
     compute the clipped PPO objective
     add the KL penalty (Section 23.7)
     update the policy AND the value model
# Repeat for many iterations

✧

Reward Note: PPO Is Powerful but Heavy

PPO made RLHF work — it powered InstructGPT and the first ChatGPT. But it is complex: four models in memory, a delicate clipped objective, many hyperparameters (clip range, value-loss weight, KL coefficient, GAE lambda), and a generation step inside every training iteration. Getting PPO stable is notoriously finicky, which motivated the simpler alternatives — DPO (Chapter 24) and GRPO (Section 23.10).

For a beginner, the key takeaways are: PPO improves the policy in small clipped steps, uses a value model for the baseline, and is the classic but heavyweight choice. The next sections add the missing safety mechanism (KL) and then a lighter-weight alternative (GRPO).

There is a serious danger lurking in RLHF, and the KL penalty is the defense against it. Recall from Section 23.4 that the reward model is an imperfect proxy for human preferences. If we optimize against it with no constraint, the policy will discover responses that score HIGH reward but are actually BAD — it exploits the reward model's flaws. This is called REWARD HACKING, and it is the central failure mode of RLHF.

Reward Hacking: Optimizing the Proxy, Not the Goal

Imagine the reward model has a quirk: it slightly over-rewards responses that include the word 'certainly' or that are very long. With unconstrained optimization, the policy will discover this and produce responses stuffed with 'certainly' or padded to absurd length — high reward, terrible quality. The policy is optimizing the PROXY (reward model) instead of the true GOAL (human preference). This is Goodhart's law again (Chapter 21): when the reward becomes the target, it stops measuring quality.

✧

Preference Pair

Prompt: What's the weather like today? (the policy after reward-hacking)

Chosen: I don't have real-time data, but I can explain how to check the weather. (a sensible, honest response)

Rejected: Certainly! Certainly, the weather is certainly a fascinating topic that certainly deserves a certainly thorough and certainly lengthy exploration spanning many certainly-padded paragraphs... (gaming a length/word-count quirk in the reward model)

The Fix: Penalize Drift From the Reference Model

The KL penalty constrains the policy to stay CLOSE to the original SFT model (the frozen reference). It adds a penalty proportional to the KL divergence (Chapter 4) between the current policy and the reference — a measure of how much the policy's distribution has drifted. The policy is rewarded for high reward-model score, but PENALIZED for straying far from the sensible SFT model.

text•The KL-penalized RLHF objective
maximize  E[ r(y) ]  -  β · KL( πθ ∥ π_ref )

r(y) = reward-model score (pulls toward high reward)
β · KL = penalty for drifting from the SFT reference (pulls toward sanity)
β = the KL coefficient, tuned to balance the two forces.

In practice the KL term is folded into the per-token reward: each token gets the reward-model score (at the end) minus a per-token penalty for the policy assigning that token a much higher probability than the reference does. This keeps the policy anchored to fluent, sensible language while still letting it improve toward higher reward.

Python•The KL-penalized reward in code
import torch

def kl_penalized_reward(reward, policy_logprobs, ref_logprobs, beta=0.1):
    """Combine the reward-model score with a per-token KL penalty."""
    # Per-token KL: how much more likely the policy makes each token
    # than the reference does. Large gap = large drift = large penalty.
    per_token_kl = policy_logprobs - ref_logprobs   # (B, T)

    # The reward-model score is given only at the final token;
    # the KL penalty applies at every token.
    shaped = -beta * per_token_kl              # penalty everywhere
    shaped[:, -1] += reward                   # + RM score at the end
    return shaped

# beta controls the leash:
#   beta too LOW  -> policy drifts, reward-hacks, produces garbage
#   beta too HIGH -> policy barely changes from SFT, no improvement
# Tuning beta is one of the trickiest parts of RLHF.

⚠️

The KL Penalty Is the Whole Balancing Act

RLHF is a tug-of-war. The reward pulls the policy toward higher-scoring responses; the KL penalty pulls it back toward the sensible reference. Too little KL and the policy reward-hacks into nonsense; too much and it never improves. The KL coefficient β sets the tension, and finding the right value is one of the most delicate, dataset-specific parts of getting RLHF to work.

A useful diagnostic: monitor the KL divergence during training. If it climbs rapidly, the policy is drifting (raise β or stop); if it stays near zero, the policy is barely learning (lower β). Watching the KL is to RLHF what watching the gradient norm is to pretraining.

We can now assemble the complete RLHF loop. This is where the four models from Section 23.2 — policy, reference, reward, value — all come together. Seeing them interact in one place makes the whole pipeline concrete.

Arch Stack: The four models of PPO-based RLHF

Policy (training)	generates responses, gets updated
Value model (training)	predicts baseline for advantages
Reward model (frozen)	scores responses
Reference model (frozen)	anchors the KL penalty

Python•Code Lab: the RLHF/PPO training loop (simplified)
import torch; import torch.nn.functional as F

# Four models: policy (train), reference & reward (frozen), value (train)
policy   = load_sft_model()           # the model we improve
ref      = load_sft_model().eval()     # frozen copy for KL
reward_m = load_reward_model().eval()  # frozen, from Section 23.4
value_m  = load_value_model()          # trained for the baseline
opt = torch.optim.AdamW(list(policy.parameters()) + list(value_m.parameters()), lr=1e-6)

for iteration in range(n_iterations):
    prompts = sample_prompts(batch_size)

    # 1. GENERATE responses from the current policy
    responses = policy.generate(prompts)

    # 2. SCORE with the reward model + per-token KL vs reference
    with torch.no_grad():
        rewards    = reward_m(prompts, responses)
        ref_logp   = ref.log_probs(prompts, responses)
    pol_logp   = policy.log_probs(prompts, responses)
    shaped_rew = kl_penalized_reward(rewards, pol_logp, ref_logp.detach())

    # 3. ADVANTAGES from the value model (the baseline)
    values     = value_m(prompts, responses)
    advantages = compute_gae(shaped_rew, values)   # generalized advantage estimation

    # 4. PPO UPDATE: several mini-epochs over this batch
    for _ in range(ppo_epochs):
        new_logp = policy.log_probs(prompts, responses)
        ratio    = torch.exp(new_logp - pol_logp.detach())   # π/π_old
        clipped  = torch.clamp(ratio, 0.8, 1.2)           # 1±ε
        policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
        value_loss  = F.mse_loss(value_m(prompts, responses), shaped_rew)
        (policy_loss + 0.5 * value_loss).backward()
        opt.step(); opt.zero_grad()

# Note the TINY learning rate (1e-6): RL updates are delicate.
# Generation happens INSIDE the loop -- a big reason RLHF is slow.

✧

Reward Note: Generation-in-the-Loop Makes RLHF Slow

Unlike SFT, where the data is fixed, RLHF GENERATES fresh responses from the current policy on every iteration — then scores them, then updates. This generation step (an autoregressive decode for every prompt, every iteration) is expensive and dominates the cost. RLHF is far slower per step than SFT, and the generation/training interleaving is a major engineering challenge.

This is one more reason the field sought simpler alternatives. DPO (Chapter 24) avoids generation entirely by working directly on the fixed preference pairs. GRPO (next section) still generates, but removes the value model. Each trades away some of PPO's complexity.

RLHF is powerful but notoriously difficult. It combines the instability of reinforcement learning with the imperfection of a learned reward model and the complexity of juggling four models. Understanding the failure modes is essential — they explain why so much of the craft of alignment is about taming RLHF.

Failure mode	What happens	Defense
Reward hacking	Policy exploits reward-model flaws	KL penalty, better reward model
Over-optimization	Reward rises but true quality falls	Early stopping, monitor real quality
Mode collapse	Policy converges to repetitive outputs	KL penalty, entropy bonus
Reward-model drift	Policy moves outside RM's reliable range	Retrain RM on fresh policy outputs
Instability / divergence	Training collapses suddenly	Tiny lr, clipping, careful tuning
Sycophancy	Model tells humans what they want to hear	Careful annotation guidelines
Calibration loss	Model becomes overconfident	(Ch. 21) hard to fully fix

Over-Optimization: The Goodhart Curve

A characteristic RLHF phenomenon: as you optimize against the reward model, the reward-model SCORE keeps rising, but the TRUE quality (measured by held-out humans) rises, peaks, and then FALLS. Beyond the peak, the policy is exploiting the reward model rather than improving — over-optimization. The art is to stop near the peak, before the proxy and the goal diverge. This 'Goodhart curve' is one of the most important empirical findings in RLHF (Gao et al., 2022).

⚠️

Reward-Model Score Is Not the Goal

The single most important mental discipline in RLHF: the reward-model score is a PROXY, not the objective. A rising reward-model score does NOT guarantee a better model — past the over-optimization point, it guarantees a WORSE one. You must evaluate true quality (held-out human judgment or a trusted eval) and be willing to stop even while the reward-model score is still climbing.

Teams that chase the reward-model number end up with reward-hacked models that score beautifully and behave terribly. The number is a guide, not the goal. This discipline — distrust the proxy — is the hardest and most important lesson of RLHF.

▶

ML Connection: Why DPO and GRPO Emerged

RLHF's difficulty — four models, reward hacking, instability, slow generation-in-the-loop, dozens of hyperparameters — created strong pressure for simpler methods. DPO (Chapter 24) eliminates the reward model AND the RL entirely, reframing alignment as a simple classification problem on preference pairs. GRPO (next section) keeps RL but removes the value model and simplifies the advantage.

Both are responses to the same pain: classic PPO-based RLHF works, but it is hard. The field has been steadily searching for methods that capture RLHF's benefits with less of its fragility.

GRPO (Group Relative Policy Optimization), introduced by DeepSeek (Shao et al., 2024) and central to the DeepSeek-R1 reasoning models, is a streamlined alternative to PPO that has become very popular — especially for training reasoning models. Its key idea is elegant: eliminate the value model entirely by computing the baseline from a GROUP of sampled responses to the same prompt.

The Problem GRPO Solves

Recall that PPO needs a value model to estimate the baseline for advantages (Section 23.6). This value model doubles a chunk of the memory and adds training complexity — it must be learned accurately, or the advantages are wrong. GRPO asks: what if we could get a good baseline WITHOUT a separate model? Its answer: sample several responses per prompt, and use their AVERAGE reward as the baseline.

The Group-Relative Advantage

For each prompt, GRPO samples a GROUP of G responses (say, 8 or 16). It scores all of them with the reward model, then computes each response's advantage as its reward relative to the group — normalized by subtracting the group mean and dividing by the group standard deviation. A response that beats its group-mates gets positive advantage; one that loses gets negative. No value model needed.

text•GRPO group-relative advantage
For prompt q, sample a group of G responses {o₁, ..., o_G}.
Score each with the reward model: {r₁, ..., r_G}.

Aᵢ = (rᵢ - mean(r₁..r_G)) / std(r₁..r_G)

# The group's mean reward IS the baseline. Above average → A>0,
# below average → A<0. Every token in oᵢ shares the same advantage Aᵢ.

✧

Intuition: Why the Group Average Is a Great Baseline

Remember from Section 23.5 that a good baseline estimates 'expected reward', so the advantage measures 'better or worse than expected'. For a given prompt, the best estimate of the expected reward is simply the average reward of several responses to THAT prompt. The group itself provides the baseline — no separate value model required.

This is especially natural for reasoning tasks: sample 16 attempts at a math problem, and a solution is 'good' relative to how the other 15 attempts did. The group-relative advantage automatically adapts to each prompt's difficulty — hard prompts where all responses score low still produce a useful signal about which response was best.

The GRPO Objective

GRPO then uses the same clipped PPO-style objective from Section 23.6, but with the group-relative advantage and a KL penalty to the reference model included directly. Because the advantage is the same for all tokens in a response, the algorithm is simpler than PPO's per-token value estimation.

text•The GRPO objective
L_GRPO = E[ min( ρ · Aᵢ,  clip(ρ, 1-ε, 1+ε) · Aᵢ ) ]  -  β · KL(πθ ∥ π_ref)

ρ = πθ / π_old  (the policy ratio, as in PPO)
Aᵢ = the group-relative advantage of response i
# Same clipped objective as PPO, but NO value model, and the
# advantage comes from the group, not a learned baseline.

text•GRPO (one iteration) (Pseudocode)
# Three models: policy (train), reward (frozen), reference (frozen)
# NO value model!
1. sample a batch of prompts
2. for each prompt, generate a GROUP of G responses
3. score every response with the reward model
4. advantage A_i = (r_i - group_mean) / group_std
5. clipped policy update using A_i, plus KL penalty to reference
# Simpler: one fewer model, no value-function learning

Python•GRPO advantage computation from scratch
import torch

def grpo_advantages(rewards):
    """rewards: (num_prompts, group_size) reward-model scores.
       Returns group-normalized advantages of the same shape."""
    # Baseline = the per-prompt group MEAN (no value model needed)
    mean = rewards.mean(dim=1, keepdim=True)   # (num_prompts, 1)
    std  = rewards.std(dim=1, keepdim=True) + 1e-6
    # Group-relative advantage: above the group mean -> positive
    return (rewards - mean) / std            # (num_prompts, group_size)

# Example: one prompt, a group of 4 responses
rewards = torch.tensor([[0.8, 0.2, 0.9, 0.1]])
adv = grpo_advantages(rewards)
print(adv)  # the 0.9 and 0.8 responses get positive advantage,
        #  the 0.2 and 0.1 responses get negative advantage.

# Every TOKEN in a response is assigned that response's advantage,
# then used in the clipped objective. No per-token value model.

PPO	GRPO
Four models (incl. value)	Three models (no value)
Learned value-model baseline	Group-mean baseline
Per-token advantages (GAE)	One advantage per response
More memory, more tuning	Less memory, simpler
Classic RLHF (InstructGPT)	DeepSeek-R1, reasoning models
Generates 1 response/prompt	Generates a GROUP/prompt

✧

Reward Note: GRPO and the Reasoning-Model Boom

GRPO became prominent because it is well-suited to training REASONING models (Chapter 25). For a math or coding problem, you can sample many solution attempts and score them automatically (does the answer match? does the code pass tests?) — a perfect setting for group-relative advantages. DeepSeek used GRPO with such VERIFIABLE rewards to train R1's reasoning, often skipping the learned reward model entirely in favour of programmatic correctness checks.

This connects RLHF to the reasoning revolution: when rewards can be computed automatically (verifiable tasks), GRPO turns RL into a practical, scalable way to teach step-by-step problem solving. We return to this in Chapter 25.

Beyond the core algorithms, getting RLHF to work in practice involves a set of variants and practical lessons. This section collects the most important ones.

The Family of Methods

Method	Key idea	Trade-off
PPO-RLHF	Reward model + clipped policy RL	Powerful, complex, unstable
GRPO	Group-relative advantage, no value model	Simpler, great for reasoning
RLAIF	AI feedback replaces human labels	Scalable, needs a good judge
DPO	Skip RL; classify preferences directly	Simple, stable (Chapter 24)
Rejection sampling	Keep best-of-N by reward, then SFT	Very simple, a strong baseline
RLVR	RL with verifiable rewards	For math/code (Chapter 25)

Rejection Sampling: The Simple Baseline

Before reaching for full PPO, it is worth knowing the simplest preference-based method: rejection sampling (or best-of-N). Generate N responses per prompt, score them with the reward model, keep the best one, and add it to an SFT dataset. Then fine-tune on these best responses. This 'distills' the reward model's preferences into the policy via plain SFT — no RL machinery at all. It is surprisingly effective and is often used as a first step or a strong baseline.

✧

Reward Note: Practical Wisdom

Hard-won lessons from practitioners: (1) Use a tiny learning rate — RL updates are far more delicate than SFT. (2) Watch the KL divergence as your primary health signal. (3) Do not trust the reward-model score; evaluate true quality. (4) Start with the simplest method that works — rejection sampling or DPO — before reaching for PPO. (5) Spend effort on the reward model and preference data; they matter more than the RL algorithm.

The recurring theme: the algorithm is the easy part. The DATA (preferences), the REWARD MODEL (the proxy), and the DISCIPLINE (distrusting the proxy) are what make or break RLHF.

RLHF's Place in the Modern Recipe

Today, RLHF (or a preference-optimization stand-in like DPO) is a standard stage in building frontier assistants. The typical modern recipe is: pretrain (Part IV), SFT (Chapter 22), then preference optimization — increasingly DPO or GRPO rather than classic PPO — then safety tuning (Chapter 26). RLHF was the breakthrough that made models genuinely helpful and aligned; its descendants continue to do that work with less fragility.

RLHF Quick-Reference

Concept	Key idea	Remember
Why RLHF	Comparing is easier than writing	Breaks the SFT imitation ceiling
Preference data	(prompt, chosen, rejected) triples	Guidelines = model values
Reward model	Bradley-Terry: σ(r_w - r_l)	An imperfect proxy
Policy gradient	Do more of what earned reward	REINFORCE + baseline
PPO	Clipped objective + value model	Powerful but heavy (4 models)
KL penalty	Stay close to the SFT reference	Prevents reward hacking
Over-optimization	Reward up, true quality down	Distrust the proxy; stop early
GRPO	Group-mean baseline, no value model	Simpler; great for reasoning

Exercises

Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.

✎

Exercise 1: Pen & Paper

Explain the two limits of SFT and why preferences carry a richer signal than demonstrations. Give an example task where comparing is much easier than writing.

✎

Exercise 2: Pen & Paper

Name the four models in PPO-based RLHF and state each one's role. Which are frozen and which are trained?

✎

Exercise 3: Pen & Paper

Why is pairwise comparison preferred over absolute rating for preference collection? Give two reasons grounded in annotator behaviour.

✎

Exercise 4: Derive

Starting from the Bradley-Terry model P(A>B)=σ(r_A-r_B), derive the reward-model loss -log σ(r_w - r_l). Explain what minimizing it does to the rewards.

✎

Exercise 5: Pen & Paper

Map the RL vocabulary (policy, state, action, reward, trajectory) onto language generation. What is the 'policy' concretely?

✎

Exercise 6: Derive

Write the REINFORCE gradient and explain in words what each factor does. Why does multiplying by reward implement 'do more of what worked'?

✎

Exercise 7: Pen & Paper

Why does plain REINFORCE have high variance, and how does a baseline (advantage) reduce it? What does A>0 vs A<0 mean?

✎

Exercise 8: Pen & Paper

Explain the PPO clipped objective. What does the clip prevent, and why is taking the min of the two terms the right thing to do?

✎

Exercise 9: Pen & Paper

Explain reward hacking with a concrete example. Why does the KL penalty defend against it, and what goes wrong if β is too low or too high?

✎

Exercise 10: Pen & Paper

Describe the over-optimization (Goodhart) curve. Why must you sometimes stop training while the reward-model score is still rising?

✎

Exercise 11: Pen & Paper

Compare PPO and GRPO. What does GRPO remove, where does its baseline come from, and why is it well-suited to reasoning tasks?

✎

Exercise 12: Code

Implement a reward model: a Transformer body with a scalar head. Implement the Bradley-Terry loss and train it on a small set of preference pairs.

✎

Exercise 13: Code

Verify your reward model: after training, check that it assigns higher rewards to held-out chosen responses than rejected ones. Report pairwise accuracy.

✎

Exercise 14: Code

Implement REINFORCE on a toy text task (e.g. generate sequences that maximize a simple programmatic reward). Show the average reward rising over training.

✎

Exercise 15: Code

Add a baseline to your REINFORCE implementation (running mean of rewards). Plot the gradient variance with and without the baseline.

✎

Exercise 16: Code

Implement the PPO clipped objective. On synthetic advantages and ratios, verify the clipping behaviour: show updates are capped at 1±ε.

✎

Exercise 17: Code

Implement the KL-penalized per-token reward. Sweep β and show how the policy's drift (measured by KL from the reference) responds.

✎

Exercise 18: Code Lab

Build a minimal PPO-RLHF loop on a small model and a simple reward (e.g. reward responses for ending politely). Track reward, KL, and sample outputs over training.

✎

Exercise 19: Code

Implement grpo_advantages: group-normalize rewards per prompt. Verify that above-average responses get positive advantage and below-average get negative.

✎

Exercise 20: Code Lab

Implement a minimal GRPO loop: for each prompt, sample a group, score, compute group-relative advantages, and do a clipped update. Compare to your PPO loop in memory and stability.

✎

Exercise 21: Code

Implement rejection sampling (best-of-N): generate N responses, keep the highest-reward one, and SFT on the kept responses. Show it improves average reward without any RL.

✎

Exercise 22: Code (Challenge)

Build a full mini-RLHF pipeline on a small model: (1) train a reward model on preference pairs, (2) optimize the policy with BOTH PPO and GRPO against it with a KL penalty, (3) deliberately set β too low and demonstrate reward hacking, then (4) compare PPO vs GRPO on stability, memory, and final true quality (judged by a held-out check). Write up which was easier to get working and why.

Further reading: “Training language models to follow instructions with human feedback” (Ouyang et al., 2022, InstructGPT) — the canonical RLHF paper. “Deep Reinforcement Learning from Human Preferences” (Christiano et al., 2017) for the original idea. “Proximal Policy Optimization Algorithms” (Schulman et al., 2017) for PPO. “DeepSeekMath” (Shao et al., 2024) and the DeepSeek-R1 report for GRPO. “Scaling Laws for Reward Model Overoptimization” (Gao et al., 2022) for the Goodhart curve. “Constitutional AI” (Bai et al., 2022) for RLAIF. The Hugging Face TRL library for PPO, GRPO, and DPO implementations.

Next → Chapter 24: Direct Preference Optimization

RLHF works, but you have now seen how hard it is: a reward model, a value model, a delicate RL loop, reward hacking, and a tug-of-war over the KL coefficient. Chapter 24 introduces a remarkable simplification — Direct Preference Optimization (DPO) — which achieves the goal of preference alignment WITHOUT a separate reward model and WITHOUT reinforcement learning. DPO shows, through an elegant derivation, that the RLHF objective can be reframed as a simple classification loss on preference pairs, trainable with ordinary supervised learning. We will derive it from the RLHF objective of this chapter, meet its variants (IPO, KTO, ORPO), and understand when to choose DPO over the RL methods you just learned.

✎ 22 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 22. Supervised Fine-Tuning

NextCh 24. Direct Preference Optimization & Beyond

→

Reinforcement Learning from Human Feedback

Learning Objectives

Why Imitation Isn't Enough