Reinforcement Learning from Human Feedback
Chapter 22 turned a base model into an instruction-follower by imitation: we showed it good demonstrations and trained it to copy them. This works remarkably well, but it has a fundamental ceiling. This chapter is about breaking through that ceiling with a different kind of training signal — human PREFERENCES rather than demonstrations.
The Two Limits of Imitation
Supervised fine-tuning has two deep limitations that no amount of additional demonstrations can fully fix:
The Key Realization: Comparing Is Easier Than Writing
Here is the insight that launched RLHF. For many qualities we care about — helpfulness, tone, harmlessness, nuance — it is HARD for a human to write the ideal response, but EASY to compare two responses and say which is better. Writing a perfect, balanced answer to a sensitive question is difficult; judging which of two answers is more balanced is quick and reliable.
Almost anyone can instantly see that the left response is more helpful and empathetic than the right one — but writing the left response from scratch is a real skill. RLHF exploits this asymmetry: it collects easy human JUDGMENTS (which is better?) at scale, and uses them to push the model toward responses humans prefer, beyond what demonstrations alone could teach.
RLHF, as introduced by InstructGPT (Ouyang et al., 2022) and used to build ChatGPT, Claude, and others, is a three-stage pipeline. Before diving into the math of each stage, it helps to see the whole shape, so you always know where you are.
Pipeline Flow: The three stages of RLHF
| 1 | SFT model | Start from a supervised fine-tuned model (Chapter 22) — the initial 'policy' |
| 2 | Collect prefs | Show humans pairs of model responses; they pick the better one in each pair |
| 3 | Reward model | Train a model to predict human preferences — it outputs a scalar 'reward' for any response |
| 4 | RL optimize | Use RL (PPO or GRPO) to update the policy to produce responses the reward model scores highly |
The Cast of Characters
RLHF involves several models with confusingly similar names. Let us name them all up front so the later sections are clear:
| Model | Role |
|---|---|
| Policy | The model we are training — it generates responses. Starts as the SFT model. |
| Reference model | A FROZEN copy of the SFT model. Used to measure how far the policy drifts (the KL penalty). |
| Reward model | Predicts a scalar reward (how much humans would prefer this response). Frozen during RL. |
| Value model | (PPO only) Estimates expected future reward, used to compute advantages. Trained alongside the policy. |
The fuel for RLHF is preference data. The process is straightforward: take a prompt, have the model generate two (or more) responses, and ask a human to pick the better one. Each judgment produces a preference pair — a 'chosen' response and a 'rejected' response for the same prompt. This is the same structure you saw in Chapter 22's prefPair visual, now the central training signal.
How Preferences Are Collected
# Repeat for many prompts:
1. sample a prompt q (from real usage or a curated set)
2. generate two responses y_A, y_B from the current policy
3. show a human annotator q, y_A, y_B
4. human picks the better one → (chosen, rejected) pair
# Result: a dataset of (prompt, chosen, rejected) triplesPairwise Comparison vs Direct Rating
Why compare two responses instead of just rating each on a scale of 1–10? Because humans are far more consistent at relative judgments than absolute ones. My '7/10' and your '7/10' may mean different things, and my own ratings drift over a long session. But 'A is better than B' is stable and consistent across annotators. Pairwise comparison sidesteps the calibration problem of absolute scores.
| Pairwise comparison (used) | Direct rating (avoided) |
|---|---|
| 'A is better than B' | 'A scores 7/10' |
| Consistent across annotators | Annotators calibrate differently |
| Stable over a long session | Ratings drift over time |
| Easy, fast judgments | Requires an absolute scale |
| Yields preference pairs | Yields noisy scalar labels |
The Quality and Cost of Preference Data
Preference data is expensive and its quality is paramount. Annotators must be trained, given clear guidelines (what counts as 'better'?), and monitored for agreement. Disagreement is inevitable — reasonable people prefer different responses — so the data is inherently noisy. The guidelines encode the values the model will learn: if annotators are told to prefer concise answers, the model becomes concise. The preference data IS the specification of desired behaviour.
We cannot run reinforcement learning against a human — humans are far too slow to provide feedback on every one of the millions of responses RL generates. So we train a REWARD MODEL: a model that LEARNS to predict human preferences, then provides instant reward signals during RL. The reward model is the bridge from slow, expensive human judgments to fast, scalable training signal.
What the Reward Model Computes
A reward model takes a prompt and a response and outputs a single number — a scalar reward — representing how much a human would prefer that response. It is usually built by taking a pretrained model (often the SFT model), removing its token-prediction head, and adding a small head that outputs one number instead of a vocabulary distribution.
Arch Stack: Reward model: a Transformer with a scalar head
| scalar reward r | one number |
| reward head | (d → 1) linear |
| final hidden state | (d,) |
| Transformer body | from the SFT model |
| prompt + response tokens | (T,) |
The Bradley-Terry Model and the Reward Loss
How do we train a model to output rewards when our data is only PAIRWISE preferences (A is better than B), not absolute scores? The answer is the Bradley-Terry model, a classic statistical model of pairwise comparisons. It says: the probability that response A is preferred over B is the logistic function of the DIFFERENCE in their rewards.
P(A preferred over B) = σ( r(A) - r(B) )
where σ(x) = 1/(1+e⁻ˣ) is the logistic sigmoid, and
r(·) is the scalar reward the model assigns.
Bigger reward gap → more confident preference.This gives us a training objective. For each preference pair (chosen y_w, rejected y_l), we want the reward model to assign a HIGHER reward to the chosen response. We maximize the probability of the observed preference, which means minimizing the following loss:
L_RM = -E[ log σ( r(y_w) - r(y_l) ) ]
y_w = chosen (preferred) response
y_l = rejected response
# Pushes r(chosen) up and r(rejected) down, until their gap
# matches the strength of the human preference.import torch; import torch.nn.functional as F
class RewardModel(torch.nn.Module):
def __init__(self, base_model, d_model):
super().__init__()
self.body = base_model # Transformer from the SFT model
self.head = torch.nn.Linear(d_model, 1) # outputs ONE number
def forward(self, input_ids):
h = self.body(input_ids).last_hidden_state # (B, T, d)
# Reward = scalar from the LAST token's hidden state
return self.head(h[:, -1]).squeeze(-1) # (B,)
def reward_loss(rm, chosen_ids, rejected_ids):
"""Bradley-Terry loss: chosen should score higher than rejected."""
r_chosen = rm(chosen_ids) # (B,)
r_rejected = rm(rejected_ids) # (B,)
# -log sigma(r_w - r_l): minimized when r_chosen >> r_rejected
return -F.logsigmoid(r_chosen - r_rejected).mean()
# Train this like any classifier. After training, the reward model
# scores ANY (prompt, response) -- the instant feedback RL needs.Now we have a reward model that scores responses. The RL stage uses that reward to improve the policy. But many readers have never studied reinforcement learning, so this section builds up just enough RL from scratch to understand RLHF. We will keep it concrete and tied to language models throughout.
Reframing Generation as a Sequence of Decisions
In RL terms, generating a response is a sequence of DECISIONS. At each step, the model is in a STATE (the prompt plus the tokens generated so far), takes an ACTION (choosing the next token), and eventually receives a REWARD (the reward model's score of the finished response). The model's strategy for choosing actions is called its POLICY — which is exactly the language model's probability distribution over next tokens.
| RL term | In language modeling |
|---|---|
| Policy π | The language model itself — its distribution over next tokens |
| State | The prompt plus the tokens generated so far |
| Action | Choosing the next token |
| Trajectory | A full generated response (sequence of actions) |
| Reward | The reward model's score of the completed response |
| Return | The total reward for the trajectory (here, just the final reward) |
The Goal: Maximize Expected Reward
The objective of RL is simple to state: adjust the policy so that the responses it generates get high reward, ON AVERAGE. Formally, we want to maximize the expected reward over the responses the policy produces:
maximize J(θ) = E[ r(y) ] over responses y sampled from policy πθ
θ = the policy's parameters
r(y) = the reward model's score of response y
# Make high-reward responses more likely, low-reward ones less likely.The Policy Gradient: REINFORCE
How do we increase expected reward by gradient descent? The trick — the policy gradient theorem — gives a beautifully intuitive answer. To make high-reward responses more likely, increase the probability of the actions that led to them, weighted by how much reward they earned. The simplest version is the REINFORCE algorithm:
∇θ J = E[ r(y) · ∇θ log πθ(y) ]
In words: nudge the policy to make response y MORE likely
in proportion to its reward r(y).
high reward → push its probability UP
low reward → push its probability DOWNThe Problem with Plain REINFORCE: Variance
Plain REINFORCE works but is extremely noisy. The reward r(y) varies wildly from sample to sample, so the gradient estimate jumps around, making training slow and unstable. The standard fix is a BASELINE: instead of weighting by the raw reward, weight by how much BETTER than average a response was. This 'advantage' — reward minus a baseline — has much lower variance and is the key idea connecting REINFORCE to PPO.
A(y) = r(y) - b (advantage = reward minus a baseline b)
∇θ J = E[ A(y) · ∇θ log πθ(y) ]
# A > 0: better than average → increase probability
# A < 0: worse than average → decrease probability
# Lower variance than using raw reward → more stable training.REINFORCE with a baseline is the conceptual foundation, but the algorithm actually used in classic RLHF is PPO (Proximal Policy Optimization; Schulman et al., 2017). PPO adds two crucial ingredients that make policy-gradient RL stable enough to train language models: a learned value model for the baseline, and a CLIPPED objective that prevents the policy from changing too fast.
Ingredient 1: The Value Model
PPO learns the baseline with a VALUE MODEL — a network (often sharing the body of the policy or reward model) that predicts the expected reward from a given state. The advantage is then the actual reward minus the value model's prediction: how much better the outcome was than expected. The value model is trained alongside the policy to predict rewards accurately.
Ingredient 2: The Clipped Objective
The danger in policy-gradient RL is taking too large a step — a big update can collapse the policy into producing garbage, from which it never recovers. PPO prevents this by CLIPPING: it limits how much the policy's probability for an action can change in a single update. If an update would change a token's probability by more than a small factor, the change is clipped.
ratio ρ = πθ(a|s) / π_old(a|s) # how much the policy changed
L_PPO = E[ min( ρ · A, clip(ρ, 1-ε, 1+ε) · A ) ]
ε ≈ 0.2 is the clip range.
# The clip caps the update: even if A is large, the policy can't
# move more than ±ε, keeping each step small and safe.The min-of-two-terms looks cryptic but is doing something simple: it takes the more pessimistic (smaller) of the unclipped and clipped objectives, which removes the incentive to push the policy ratio far beyond 1±ε. The effect is that PPO improves the policy in small, safe steps — 'proximal' means it stays close to the previous policy each update.
# Given: policy, value model, frozen reward model, frozen reference
1. sample a batch of prompts
2. generate responses from the current policy
3. score responses with the reward model → rewards
4. compute advantages A using the value model as baseline
5. for several mini-epochs over this batch:
compute the clipped PPO objective
add the KL penalty (Section 23.7)
update the policy AND the value model
# Repeat for many iterationsThere is a serious danger lurking in RLHF, and the KL penalty is the defense against it. Recall from Section 23.4 that the reward model is an imperfect proxy for human preferences. If we optimize against it with no constraint, the policy will discover responses that score HIGH reward but are actually BAD — it exploits the reward model's flaws. This is called REWARD HACKING, and it is the central failure mode of RLHF.
Reward Hacking: Optimizing the Proxy, Not the Goal
Imagine the reward model has a quirk: it slightly over-rewards responses that include the word 'certainly' or that are very long. With unconstrained optimization, the policy will discover this and produce responses stuffed with 'certainly' or padded to absurd length — high reward, terrible quality. The policy is optimizing the PROXY (reward model) instead of the true GOAL (human preference). This is Goodhart's law again (Chapter 21): when the reward becomes the target, it stops measuring quality.
The Fix: Penalize Drift From the Reference Model
The KL penalty constrains the policy to stay CLOSE to the original SFT model (the frozen reference). It adds a penalty proportional to the KL divergence (Chapter 4) between the current policy and the reference — a measure of how much the policy's distribution has drifted. The policy is rewarded for high reward-model score, but PENALIZED for straying far from the sensible SFT model.
maximize E[ r(y) ] - β · KL( πθ ∥ π_ref )
r(y) = reward-model score (pulls toward high reward)
β · KL = penalty for drifting from the SFT reference (pulls toward sanity)
β = the KL coefficient, tuned to balance the two forces.In practice the KL term is folded into the per-token reward: each token gets the reward-model score (at the end) minus a per-token penalty for the policy assigning that token a much higher probability than the reference does. This keeps the policy anchored to fluent, sensible language while still letting it improve toward higher reward.
import torch
def kl_penalized_reward(reward, policy_logprobs, ref_logprobs, beta=0.1):
"""Combine the reward-model score with a per-token KL penalty."""
# Per-token KL: how much more likely the policy makes each token
# than the reference does. Large gap = large drift = large penalty.
per_token_kl = policy_logprobs - ref_logprobs # (B, T)
# The reward-model score is given only at the final token;
# the KL penalty applies at every token.
shaped = -beta * per_token_kl # penalty everywhere
shaped[:, -1] += reward # + RM score at the end
return shaped
# beta controls the leash:
# beta too LOW -> policy drifts, reward-hacks, produces garbage
# beta too HIGH -> policy barely changes from SFT, no improvement
# Tuning beta is one of the trickiest parts of RLHF.We can now assemble the complete RLHF loop. This is where the four models from Section 23.2 — policy, reference, reward, value — all come together. Seeing them interact in one place makes the whole pipeline concrete.
Arch Stack: The four models of PPO-based RLHF
| Policy (training) | generates responses, gets updated |
| Value model (training) | predicts baseline for advantages |
| Reward model (frozen) | scores responses |
| Reference model (frozen) | anchors the KL penalty |
import torch; import torch.nn.functional as F
# Four models: policy (train), reference & reward (frozen), value (train)
policy = load_sft_model() # the model we improve
ref = load_sft_model().eval() # frozen copy for KL
reward_m = load_reward_model().eval() # frozen, from Section 23.4
value_m = load_value_model() # trained for the baseline
opt = torch.optim.AdamW(list(policy.parameters()) + list(value_m.parameters()), lr=1e-6)
for iteration in range(n_iterations):
prompts = sample_prompts(batch_size)
# 1. GENERATE responses from the current policy
responses = policy.generate(prompts)
# 2. SCORE with the reward model + per-token KL vs reference
with torch.no_grad():
rewards = reward_m(prompts, responses)
ref_logp = ref.log_probs(prompts, responses)
pol_logp = policy.log_probs(prompts, responses)
shaped_rew = kl_penalized_reward(rewards, pol_logp, ref_logp.detach())
# 3. ADVANTAGES from the value model (the baseline)
values = value_m(prompts, responses)
advantages = compute_gae(shaped_rew, values) # generalized advantage estimation
# 4. PPO UPDATE: several mini-epochs over this batch
for _ in range(ppo_epochs):
new_logp = policy.log_probs(prompts, responses)
ratio = torch.exp(new_logp - pol_logp.detach()) # π/π_old
clipped = torch.clamp(ratio, 0.8, 1.2) # 1±ε
policy_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
value_loss = F.mse_loss(value_m(prompts, responses), shaped_rew)
(policy_loss + 0.5 * value_loss).backward()
opt.step(); opt.zero_grad()
# Note the TINY learning rate (1e-6): RL updates are delicate.
# Generation happens INSIDE the loop -- a big reason RLHF is slow.RLHF is powerful but notoriously difficult. It combines the instability of reinforcement learning with the imperfection of a learned reward model and the complexity of juggling four models. Understanding the failure modes is essential — they explain why so much of the craft of alignment is about taming RLHF.
| Failure mode | What happens | Defense |
|---|---|---|
| Reward hacking | Policy exploits reward-model flaws | KL penalty, better reward model |
| Over-optimization | Reward rises but true quality falls | Early stopping, monitor real quality |
| Mode collapse | Policy converges to repetitive outputs | KL penalty, entropy bonus |
| Reward-model drift | Policy moves outside RM's reliable range | Retrain RM on fresh policy outputs |
| Instability / divergence | Training collapses suddenly | Tiny lr, clipping, careful tuning |
| Sycophancy | Model tells humans what they want to hear | Careful annotation guidelines |
| Calibration loss | Model becomes overconfident | (Ch. 21) hard to fully fix |
Over-Optimization: The Goodhart Curve
A characteristic RLHF phenomenon: as you optimize against the reward model, the reward-model SCORE keeps rising, but the TRUE quality (measured by held-out humans) rises, peaks, and then FALLS. Beyond the peak, the policy is exploiting the reward model rather than improving — over-optimization. The art is to stop near the peak, before the proxy and the goal diverge. This 'Goodhart curve' is one of the most important empirical findings in RLHF (Gao et al., 2022).
GRPO (Group Relative Policy Optimization), introduced by DeepSeek (Shao et al., 2024) and central to the DeepSeek-R1 reasoning models, is a streamlined alternative to PPO that has become very popular — especially for training reasoning models. Its key idea is elegant: eliminate the value model entirely by computing the baseline from a GROUP of sampled responses to the same prompt.
The Problem GRPO Solves
Recall that PPO needs a value model to estimate the baseline for advantages (Section 23.6). This value model doubles a chunk of the memory and adds training complexity — it must be learned accurately, or the advantages are wrong. GRPO asks: what if we could get a good baseline WITHOUT a separate model? Its answer: sample several responses per prompt, and use their AVERAGE reward as the baseline.
The Group-Relative Advantage
For each prompt, GRPO samples a GROUP of G responses (say, 8 or 16). It scores all of them with the reward model, then computes each response's advantage as its reward relative to the group — normalized by subtracting the group mean and dividing by the group standard deviation. A response that beats its group-mates gets positive advantage; one that loses gets negative. No value model needed.
For prompt q, sample a group of G responses {o₁, ..., o_G}.
Score each with the reward model: {r₁, ..., r_G}.
Aᵢ = (rᵢ - mean(r₁..r_G)) / std(r₁..r_G)
# The group's mean reward IS the baseline. Above average → A>0,
# below average → A<0. Every token in oᵢ shares the same advantage Aᵢ.The GRPO Objective
GRPO then uses the same clipped PPO-style objective from Section 23.6, but with the group-relative advantage and a KL penalty to the reference model included directly. Because the advantage is the same for all tokens in a response, the algorithm is simpler than PPO's per-token value estimation.
L_GRPO = E[ min( ρ · Aᵢ, clip(ρ, 1-ε, 1+ε) · Aᵢ ) ] - β · KL(πθ ∥ π_ref)
ρ = πθ / π_old (the policy ratio, as in PPO)
Aᵢ = the group-relative advantage of response i
# Same clipped objective as PPO, but NO value model, and the
# advantage comes from the group, not a learned baseline.# Three models: policy (train), reward (frozen), reference (frozen)
# NO value model!
1. sample a batch of prompts
2. for each prompt, generate a GROUP of G responses
3. score every response with the reward model
4. advantage A_i = (r_i - group_mean) / group_std
5. clipped policy update using A_i, plus KL penalty to reference
# Simpler: one fewer model, no value-function learningimport torch
def grpo_advantages(rewards):
"""rewards: (num_prompts, group_size) reward-model scores.
Returns group-normalized advantages of the same shape."""
# Baseline = the per-prompt group MEAN (no value model needed)
mean = rewards.mean(dim=1, keepdim=True) # (num_prompts, 1)
std = rewards.std(dim=1, keepdim=True) + 1e-6
# Group-relative advantage: above the group mean -> positive
return (rewards - mean) / std # (num_prompts, group_size)
# Example: one prompt, a group of 4 responses
rewards = torch.tensor([[0.8, 0.2, 0.9, 0.1]])
adv = grpo_advantages(rewards)
print(adv) # the 0.9 and 0.8 responses get positive advantage,
# the 0.2 and 0.1 responses get negative advantage.
# Every TOKEN in a response is assigned that response's advantage,
# then used in the clipped objective. No per-token value model.| PPO | GRPO |
|---|---|
| Four models (incl. value) | Three models (no value) |
| Learned value-model baseline | Group-mean baseline |
| Per-token advantages (GAE) | One advantage per response |
| More memory, more tuning | Less memory, simpler |
| Classic RLHF (InstructGPT) | DeepSeek-R1, reasoning models |
| Generates 1 response/prompt | Generates a GROUP/prompt |
Beyond the core algorithms, getting RLHF to work in practice involves a set of variants and practical lessons. This section collects the most important ones.
The Family of Methods
| Method | Key idea | Trade-off |
|---|---|---|
| PPO-RLHF | Reward model + clipped policy RL | Powerful, complex, unstable |
| GRPO | Group-relative advantage, no value model | Simpler, great for reasoning |
| RLAIF | AI feedback replaces human labels | Scalable, needs a good judge |
| DPO | Skip RL; classify preferences directly | Simple, stable (Chapter 24) |
| Rejection sampling | Keep best-of-N by reward, then SFT | Very simple, a strong baseline |
| RLVR | RL with verifiable rewards | For math/code (Chapter 25) |
Rejection Sampling: The Simple Baseline
Before reaching for full PPO, it is worth knowing the simplest preference-based method: rejection sampling (or best-of-N). Generate N responses per prompt, score them with the reward model, keep the best one, and add it to an SFT dataset. Then fine-tune on these best responses. This 'distills' the reward model's preferences into the policy via plain SFT — no RL machinery at all. It is surprisingly effective and is often used as a first step or a strong baseline.
RLHF's Place in the Modern Recipe
Today, RLHF (or a preference-optimization stand-in like DPO) is a standard stage in building frontier assistants. The typical modern recipe is: pretrain (Part IV), SFT (Chapter 22), then preference optimization — increasingly DPO or GRPO rather than classic PPO — then safety tuning (Chapter 26). RLHF was the breakthrough that made models genuinely helpful and aligned; its descendants continue to do that work with less fragility.
RLHF Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Why RLHF | Comparing is easier than writing | Breaks the SFT imitation ceiling |
| Preference data | (prompt, chosen, rejected) triples | Guidelines = model values |
| Reward model | Bradley-Terry: σ(r_w - r_l) | An imperfect proxy |
| Policy gradient | Do more of what earned reward | REINFORCE + baseline |
| PPO | Clipped objective + value model | Powerful but heavy (4 models) |
| KL penalty | Stay close to the SFT reference | Prevents reward hacking |
| Over-optimization | Reward up, true quality down | Distrust the proxy; stop early |
| GRPO | Group-mean baseline, no value model | Simpler; great for reasoning |
Exercises
Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.
Further reading: “Training language models to follow instructions with human feedback” (Ouyang et al., 2022, InstructGPT) — the canonical RLHF paper. “Deep Reinforcement Learning from Human Preferences” (Christiano et al., 2017) for the original idea. “Proximal Policy Optimization Algorithms” (Schulman et al., 2017) for PPO. “DeepSeekMath” (Shao et al., 2024) and the DeepSeek-R1 report for GRPO. “Scaling Laws for Reward Model Overoptimization” (Gao et al., 2022) for the Goodhart curve. “Constitutional AI” (Bai et al., 2022) for RLAIF. The Hugging Face TRL library for PPO, GRPO, and DPO implementations.
Next → Chapter 24: Direct Preference Optimization
RLHF works, but you have now seen how hard it is: a reward model, a value model, a delicate RL loop, reward hacking, and a tug-of-war over the KL coefficient. Chapter 24 introduces a remarkable simplification — Direct Preference Optimization (DPO) — which achieves the goal of preference alignment WITHOUT a separate reward model and WITHOUT reinforcement learning. DPO shows, through an elegant derivation, that the RLHF objective can be reframed as a simple classification loss on preference pairs, trainable with ordinary supervised learning. We will derive it from the RLHF objective of this chapter, meet its variants (IPO, KTO, ORPO), and understand when to choose DPO over the RL methods you just learned.