Solutions Appendix
Chapter 23

RLHF

22 Solutions

Detailed solutions for the exercises in Chapter 23. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Two limits of SFT; why preferences carry a richer signal; an easy-to-compare-hard-to-write task.

Solution

SFT's limits: (1) it can only imitate the demonstrations it's given (capped by demonstrator quality), and (2) it has no signal about what NOT to do — only positive examples. Preferences are richer because comparing two responses conveys relative quality (including what's worse), which is information demonstrations lack. Example: judging which of two poems is better, or which summary is more faithful, is far easier than writing the best one from scratch — evaluation is easier than generation.

Exercise 2Pen & Paper
Name the four models in PPO-RLHF and their roles; which are frozen?

Solution

(1) Policy — the model being trained to generate preferred responses (TRAINED). (2) Reference model — a frozen copy of the initial policy, used for the KL penalty (FROZEN). (3) Reward model — scores responses, trained beforehand on preferences (FROZEN during PPO). (4) Value/critic model — estimates expected return for the advantage baseline (TRAINED). So policy and critic are trained; reference and reward are frozen.

Exercise 3Pen & Paper
Why pairwise comparison over absolute rating? Two annotator-behaviour reasons.

Solution

(1) Consistency: humans give noisy, drifting absolute scores (one annotator's '7' ≠ another's '7'), but are far more reliable at saying 'A is better than B'. (2) Calibration-free: comparisons need no shared numeric scale, avoiding the need to anchor and normalize ratings across annotators and time. Relative judgments are easier and more consistent than absolute ones, yielding cleaner training signal.

Exercise 4Derive
From Bradley-Terry P(A>B)=σ(r_A−r_B), derive the RM loss −logσ(r_w−r_l).

Solution

The Bradley-Terry model says the probability the winner w is preferred over loser l is σ(r_w − r_l). Maximizing the likelihood of the observed preferences means maximizing Σ logσ(r_w−r_l), i.e. minimizing the negative log-likelihood −logσ(r_w−r_l). Minimizing it pushes r_w above r_l by a comfortable margin (until the sigmoid saturates), training the reward model to score preferred responses higher.

Exercise 5Pen & Paper
Map RL vocabulary onto language generation; what is the 'policy' concretely?

Solution

State = the prompt plus tokens generated so far; action = the next token chosen; trajectory = the full generated response; reward = the reward model's score (usually given at the end of the sequence, with a per-token KL penalty). The POLICY is the language model itself — it maps the current context (state) to a distribution over next tokens (actions). Generating text is 'acting' in this RL formulation.

Exercise 6Derive
Write the REINFORCE gradient; explain each factor; why does multiplying by reward implement 'do more of what worked'?

Solution

The REINFORCE gradient is E[ ∇θ log πθ(a|s) · R ]. The factor ∇θ log πθ(a|s) points in the direction that increases the probability of the taken action; multiplying by the reward R scales that step by how good the outcome was. So high-reward trajectories get their actions made MORE likely (positive scaling) and low-reward ones less likely — 'do more of what worked'. It is likelihood-weighted by outcome.

Exercise 7Pen & Paper
Why does plain REINFORCE have high variance? How does a baseline reduce it? A>0 vs A<0?

Solution

Rewards vary wildly across trajectories, so the gradient estimate is noisy (high variance) — even good actions in a low-reward trajectory get pushed down. Subtracting a baseline b (e.g. the average/expected reward) gives the advantage A = R − b, which has lower variance without changing the expected gradient (the baseline is action-independent). A>0 means the trajectory was better than expected (increase its actions' probability); A<0 means worse than expected (decrease it). The baseline centers the signal.

Exercise 8Pen & Paper
Explain the PPO clipped objective; what does the clip prevent; why take the min?

Solution

PPO maximizes min(ρ·A, clip(ρ, 1−ε, 1+ε)·A), where ρ = π_new/π_old is the probability ratio. The clip prevents the policy from moving too far in one update (large ρ) which would be unstable given the on-policy advantage estimate. Taking the MIN makes the objective pessimistic: it caps the benefit of a large favorable move but does NOT cap the penalty of a large unfavorable one, so the update is conservative in the right direction — stable improvement without runaway steps.

Exercise 9Pen & Paper
Explain reward hacking with an example; why does the KL penalty defend; effect of β too low/high?

Solution

Reward hacking is when the policy maximizes the reward model's score without genuinely improving — e.g. learning that the RM rewards long, confident, or flattering answers, so it pads responses regardless of quality. The KL penalty keeps the policy close to the reference model, preventing it from drifting into degenerate, RM-exploiting regions far from sensible language. If β is too low, the policy drifts and hacks the reward; if too high, it barely changes from the reference and learns little. β balances improvement against staying grounded.

Exercise 10Pen & Paper
Describe the over-optimization (Goodhart) curve; why stop while RM score is still rising?

Solution

As you optimize against the reward model, TRUE quality first rises then falls even as the RM SCORE keeps climbing — the policy increasingly exploits the RM's imperfections (Goodhart: the proxy diverges from the goal). The curve of true quality vs RM score is hump-shaped. You must stop near the peak of true quality, which occurs while the RM score is still increasing — chasing the RM score past that point degrades the actual model.

Exercise 11Pen & Paper
Compare PPO and GRPO; what does GRPO remove; where's its baseline; why suited to reasoning?

Solution

GRPO removes the separate VALUE/critic model. Instead of a learned baseline, it samples a GROUP of responses per prompt and uses the group's mean reward as the baseline, computing each response's advantage relative to its peers (group-normalized). This is simpler and more memory-efficient (no critic), and it suits reasoning because verifiable rewards (correct/incorrect) over a group give a clean relative signal — above-average solutions get positive advantage — without needing to train a value function on sparse end-rewards.

Exercise 12Code
Implement a reward model (Transformer body + scalar head); train with Bradley-Terry loss.

Solution

Attach a scalar head to a Transformer's final representation and train with −logσ(r_w−r_l) (Exercise 4) on preference pairs. The model learns to output a scalar score where chosen responses score higher than rejected — the reward signal for RLHF.

Exercise 13Code
Verify the reward model: pairwise accuracy on held-out chosen vs rejected.

Solution

Measuring how often the trained RM scores the held-out chosen response above the rejected one gives pairwise accuracy (typically 65–75% on noisy human preferences). This validates the RM generalizes the preference signal rather than memorizing — a prerequisite for useful RLHF.

Exercise 14Code
Implement REINFORCE on a toy text task; show average reward rising.

Solution

Optimizing the log-prob-weighted-by-reward objective (Exercise 6) on a task with a simple programmatic reward (e.g. produce sequences with a target property) shows the average reward climbing over training — the policy gradient 'doing more of what worked' in action.

Exercise 15Code
Add a baseline (running mean) to REINFORCE; plot gradient variance with/without.

Solution

Subtracting a running-mean baseline to form the advantage (Exercise 7) noticeably reduces the variance of the gradient estimates (visible as a tighter, less noisy training curve and lower measured gradient variance) without biasing the update — demonstrating the variance-reduction role of the baseline.

Exercise 16Code
Implement the PPO clipped objective; verify updates capped at 1±ε on synthetic ratios/advantages.

Solution

Feeding synthetic probability ratios and advantages through the clipped objective and inspecting the effective gradient confirms that favorable moves are capped once ρ exceeds 1+ε (and below 1−ε), exactly the trust-region behavior of Exercise 8.

Exercise 17Code
Implement KL-penalized per-token reward; sweep β and show policy drift responds.

Solution

Adding −β·log(π/π_ref) to each token's reward and sweeping β shows higher β keeps the policy close to the reference (low KL) while lower β lets it drift further — the controllable leash of Exercise 9, measured by KL from the reference.

Exercise 18Code Lab
Build a minimal PPO-RLHF loop on a small model + simple reward; track reward, KL, samples.

Solution

A minimal loop (generate, score with a simple reward like 'ends politely', compute advantages, clipped update with KL penalty) shows reward rising while KL stays bounded, and sample outputs shifting toward the rewarded behavior — the full RLHF loop in miniature, including the reward/KL trade-off to monitor.

Exercise 19Code
Implement grpo_advantages: group-normalize rewards per prompt; verify signs.

Solution

Subtracting the per-prompt group mean (and dividing by the group std) yields advantages where above-average responses are positive and below-average negative (Exercise 11). Verifying these signs confirms the group provides the baseline that replaces PPO's critic.

Exercise 20Code Lab
Implement a minimal GRPO loop; compare to PPO in memory and stability.

Solution

GRPO (sample a group, group-normalize rewards, clipped update) trains without a value model, using less memory and often proving more stable than PPO on the same task — the practical payoff of removing the critic (Exercise 11).

Exercise 21Code
Implement rejection sampling (best-of-N) + SFT on kept responses; show improved reward without RL.

Solution

Generating N responses, keeping the highest-reward one per prompt, and SFT-ing on those 'best-of-N' responses raises the model's average reward — a simple, stable alternative to RL that captures much of the benefit by distilling the reward model's preferences into the policy via supervised learning.

Exercise 22Code (Challenge)
Full mini-RLHF: train RM, optimize with PPO AND GRPO + KL, force reward hacking with low β, compare PPO vs GRPO.

Solution

Building the full pipeline shows the RM trained on preferences, both PPO and GRPO improving true quality with a sensible KL penalty, and — with β set too low — the policy hacking the reward (RM score up, true quality down, Exercises 9–10). Comparing PPO and GRPO on stability, memory, and held-out quality typically finds GRPO easier to get working (no critic to tune) — the integrated lesson of the chapter.