RLHF
Detailed solutions for the exercises in Chapter 23. Try solving them yourself before checking the answers.
Solution
SFT's limits: (1) it can only imitate the demonstrations it's given (capped by demonstrator quality), and (2) it has no signal about what NOT to do — only positive examples. Preferences are richer because comparing two responses conveys relative quality (including what's worse), which is information demonstrations lack. Example: judging which of two poems is better, or which summary is more faithful, is far easier than writing the best one from scratch — evaluation is easier than generation.
Solution
(1) Policy — the model being trained to generate preferred responses (TRAINED). (2) Reference model — a frozen copy of the initial policy, used for the KL penalty (FROZEN). (3) Reward model — scores responses, trained beforehand on preferences (FROZEN during PPO). (4) Value/critic model — estimates expected return for the advantage baseline (TRAINED). So policy and critic are trained; reference and reward are frozen.
Solution
(1) Consistency: humans give noisy, drifting absolute scores (one annotator's '7' ≠ another's '7'), but are far more reliable at saying 'A is better than B'. (2) Calibration-free: comparisons need no shared numeric scale, avoiding the need to anchor and normalize ratings across annotators and time. Relative judgments are easier and more consistent than absolute ones, yielding cleaner training signal.
Solution
The Bradley-Terry model says the probability the winner w is preferred over loser l is σ(r_w − r_l). Maximizing the likelihood of the observed preferences means maximizing Σ logσ(r_w−r_l), i.e. minimizing the negative log-likelihood −logσ(r_w−r_l). Minimizing it pushes r_w above r_l by a comfortable margin (until the sigmoid saturates), training the reward model to score preferred responses higher.
Solution
State = the prompt plus tokens generated so far; action = the next token chosen; trajectory = the full generated response; reward = the reward model's score (usually given at the end of the sequence, with a per-token KL penalty). The POLICY is the language model itself — it maps the current context (state) to a distribution over next tokens (actions). Generating text is 'acting' in this RL formulation.
Solution
The REINFORCE gradient is E[ ∇θ log πθ(a|s) · R ]. The factor ∇θ log πθ(a|s) points in the direction that increases the probability of the taken action; multiplying by the reward R scales that step by how good the outcome was. So high-reward trajectories get their actions made MORE likely (positive scaling) and low-reward ones less likely — 'do more of what worked'. It is likelihood-weighted by outcome.
Solution
Rewards vary wildly across trajectories, so the gradient estimate is noisy (high variance) — even good actions in a low-reward trajectory get pushed down. Subtracting a baseline b (e.g. the average/expected reward) gives the advantage A = R − b, which has lower variance without changing the expected gradient (the baseline is action-independent). A>0 means the trajectory was better than expected (increase its actions' probability); A<0 means worse than expected (decrease it). The baseline centers the signal.
Solution
PPO maximizes min(ρ·A, clip(ρ, 1−ε, 1+ε)·A), where ρ = π_new/π_old is the probability ratio. The clip prevents the policy from moving too far in one update (large ρ) which would be unstable given the on-policy advantage estimate. Taking the MIN makes the objective pessimistic: it caps the benefit of a large favorable move but does NOT cap the penalty of a large unfavorable one, so the update is conservative in the right direction — stable improvement without runaway steps.
Solution
Reward hacking is when the policy maximizes the reward model's score without genuinely improving — e.g. learning that the RM rewards long, confident, or flattering answers, so it pads responses regardless of quality. The KL penalty keeps the policy close to the reference model, preventing it from drifting into degenerate, RM-exploiting regions far from sensible language. If β is too low, the policy drifts and hacks the reward; if too high, it barely changes from the reference and learns little. β balances improvement against staying grounded.
Solution
As you optimize against the reward model, TRUE quality first rises then falls even as the RM SCORE keeps climbing — the policy increasingly exploits the RM's imperfections (Goodhart: the proxy diverges from the goal). The curve of true quality vs RM score is hump-shaped. You must stop near the peak of true quality, which occurs while the RM score is still increasing — chasing the RM score past that point degrades the actual model.
Solution
GRPO removes the separate VALUE/critic model. Instead of a learned baseline, it samples a GROUP of responses per prompt and uses the group's mean reward as the baseline, computing each response's advantage relative to its peers (group-normalized). This is simpler and more memory-efficient (no critic), and it suits reasoning because verifiable rewards (correct/incorrect) over a group give a clean relative signal — above-average solutions get positive advantage — without needing to train a value function on sparse end-rewards.
Solution
Attach a scalar head to a Transformer's final representation and train with −logσ(r_w−r_l) (Exercise 4) on preference pairs. The model learns to output a scalar score where chosen responses score higher than rejected — the reward signal for RLHF.
Solution
Measuring how often the trained RM scores the held-out chosen response above the rejected one gives pairwise accuracy (typically 65–75% on noisy human preferences). This validates the RM generalizes the preference signal rather than memorizing — a prerequisite for useful RLHF.
Solution
Optimizing the log-prob-weighted-by-reward objective (Exercise 6) on a task with a simple programmatic reward (e.g. produce sequences with a target property) shows the average reward climbing over training — the policy gradient 'doing more of what worked' in action.
Solution
Subtracting a running-mean baseline to form the advantage (Exercise 7) noticeably reduces the variance of the gradient estimates (visible as a tighter, less noisy training curve and lower measured gradient variance) without biasing the update — demonstrating the variance-reduction role of the baseline.
Solution
Feeding synthetic probability ratios and advantages through the clipped objective and inspecting the effective gradient confirms that favorable moves are capped once ρ exceeds 1+ε (and below 1−ε), exactly the trust-region behavior of Exercise 8.
Solution
Adding −β·log(π/π_ref) to each token's reward and sweeping β shows higher β keeps the policy close to the reference (low KL) while lower β lets it drift further — the controllable leash of Exercise 9, measured by KL from the reference.
Solution
A minimal loop (generate, score with a simple reward like 'ends politely', compute advantages, clipped update with KL penalty) shows reward rising while KL stays bounded, and sample outputs shifting toward the rewarded behavior — the full RLHF loop in miniature, including the reward/KL trade-off to monitor.
Solution
Subtracting the per-prompt group mean (and dividing by the group std) yields advantages where above-average responses are positive and below-average negative (Exercise 11). Verifying these signs confirms the group provides the baseline that replaces PPO's critic.
Solution
GRPO (sample a group, group-normalize rewards, clipped update) trains without a value model, using less memory and often proving more stable than PPO on the same task — the practical payoff of removing the critic (Exercise 11).
Solution
Generating N responses, keeping the highest-reward one per prompt, and SFT-ing on those 'best-of-N' responses raises the model's average reward — a simple, stable alternative to RL that captures much of the benefit by distilling the reward model's preferences into the policy via supervised learning.
Solution
Building the full pipeline shows the RM trained on preferences, both PPO and GRPO improving true quality with a sensible KL penalty, and — with β set too low — the policy hacking the reward (RM score up, true quality down, Exercises 9–10). Comparing PPO and GRPO on stability, memory, and held-out quality typically finds GRPO easier to get working (no critic to tune) — the integrated lesson of the chapter.