Solutions Appendix

Chapter 24

Direct Preference Optimization

20 Solutions

Detailed solutions for the exercises in Chapter 24. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper

What does DPO eliminate vs PPO-RLHF, and what two things does it keep? Why keep the reference model?

Solution

DPO eliminates the separately-trained reward model, the value/critic model, and the entire online RL sampling loop. It keeps (1) the preference data and (2) the frozen reference model. The reference matters because DPO's implicit reward is defined RELATIVE to it (β·log(π/π_ref)); the reference anchors the policy (playing the role of the KL penalty), preventing it from drifting arbitrarily and giving the implicit reward a meaningful baseline.

Exercise 2Pen & Paper

Explain 'reward is implicit in the policy'. What is the implicit reward of a response?

Solution

DPO's key insight is that the optimal RLHF policy and its reward are two views of the same thing: given the closed-form optimal policy under a KL-penalized reward, you can invert it to read the reward off the policy. The implicit reward of a response is r(x,y) = β·log(π(y|x)/π_ref(y|x)) (up to a prompt-dependent constant) — how much more likely the policy makes the response than the reference does. So a model that prefers a response IS, implicitly, assigning it higher reward.

Exercise 3Derive

From π* ∝ π_ref·exp(r/β), solve for r.

Solution

Taking logs of π*(y|x) = (1/Z(x))·π_ref(y|x)·exp(r(x,y)/β): logπ* = logπ_ref + r/β − logZ(x). Solving for r:

r(x,y) = β·log(π*(y|x)/π_ref(y|x)) + β·logZ(x)

The last term βlogZ(x) depends only on the prompt x, not the response — a constant across responses to the same prompt, which is exactly why it will cancel (Exercise 4).

Exercise 4Derive

Substitute the implicit reward into Bradley-Terry; show the constant cancels, yielding the DPO loss.

Solution

Bradley-Terry preference probability is σ(r_w − r_l). Substituting r = βlog(π/π_ref) + βlogZ(x), the prompt-dependent βlogZ(x) appears in BOTH r_w and r_l and cancels in the difference. This leaves the DPO loss:

−log σ( β·log(π(y_w|x)/π_ref(y_w|x)) − β·log(π(y_l|x)/π_ref(y_l|x)) )

a simple supervised loss on the policy, with no reward model needed — the cancellation of the partition function is what makes DPO work.

Exercise 5Pen & Paper

Why is DPO a 'contrastive' form of SFT? What signal does it use that SFT lacks?

Solution

Plain SFT only raises the likelihood of good (chosen) responses. DPO is contrastive: it simultaneously raises the chosen response's likelihood (relative to the reference) and LOWERS the rejected response's — using the NEGATIVE signal (what not to do) that SFT lacks. It learns from pairs, pushing chosen and rejected apart, rather than only imitating positives.

Exercise 6Pen & Paper

Role of β in DPO; behavior as β→0 and β large; connect to RLHF KL.

Solution

β controls how strongly the policy is tied to the reference. As β→0, the policy can move arbitrarily far to satisfy preferences (no anchoring — prone to overfitting/degeneration). As β grows large, the policy stays very close to the reference (minimal change). β plays exactly the role of the inverse KL-penalty weight in RLHF: small β = weak leash (large allowed KL), large β = strong leash.

Exercise 7Pen & Paper

Explain DPO's failure where BOTH chosen and rejected probabilities drop. Why, and what to monitor?

Solution

DPO only requires the MARGIN log(π_w/π_ref_w) − log(π_l/π_ref_l) to grow; it can achieve this by lowering the rejected probability FASTER than the chosen — so both absolute probabilities fall while the margin still increases. The model becomes less likely to produce the chosen (good) responses too, hurting quality. Monitor the ABSOLUTE log-probabilities of chosen responses (not just the margin/loss); if the chosen log-prob is dropping, you are in this failure mode.

Exercise 8Pen & Paper

Compare offline (DPO) and online (PPO/GRPO); why can online exceed any fixed dataset?

Solution

DPO is offline: it learns from a FIXED preference dataset, so it can only be as good as the responses in that data. Online methods (PPO/GRPO) generate FRESH responses from the current policy and get them scored each step, so they can explore and improve beyond any pre-collected responses — discovering better outputs the dataset never contained. Online's exploration is its edge; offline's simplicity and stability are DPO's.

Exercise 9Pen & Paper

How does IPO fix DPO's overfitting? Why does squared-error-to-target avoid the runaway margin?

Solution

DPO's sigmoid loss keeps rewarding ever-larger margins (it never saturates the incentive to push chosen/rejected further apart), so it can overfit and degenerate. IPO replaces the sigmoid log-loss with a SQUARED-ERROR to a fixed target margin: once the margin reaches the target, the loss is minimized and there is no incentive to push further. This bounded objective prevents the runaway margin growth that causes DPO's overfitting.

Exercise 10Pen & Paper

Compare DPO, IPO, KTO, ORPO on paired-data and reference-model needs; when is each best?

Solution

DPO: needs paired preferences + reference; good general default. IPO: paired + reference, with overfitting control; best when DPO overfits. KTO: needs only UNPAIRED good/bad labels (no pairs) + reference; best when you have thumbs-up/down data, not comparisons. ORPO: no separate reference model and combines SFT + preference in one stage; best when you want a single-stage, reference-free pipeline. Choose by what data you have (paired vs unpaired) and whether you want to avoid a reference model.

Exercise 11Code

Implement sequence_logprob; verify against a manual computation.

Solution

Summing the per-token log-probabilities of a response under the model (masking the prompt) gives the sequence log-prob. Checking it against a hand-computed value on a tiny example confirms correctness — the building block for DPO's implicit reward (Exercise 2).

Exercise 12Code

Implement the DPO loss from sequence_logprob; verify it raises chosen and lowers rejected.

Solution

Implementing −logσ(β(Δ_w − Δ_l)) where Δ = logprob_policy − logprob_ref (Exercise 4), and checking that a gradient step raises the chosen response's log-prob and lowers the rejected's, validates the loss does what it should.

Exercise 13Code Lab

Build the full DPO loop with a frozen reference; show before/after preference shift.

Solution

Fine-tuning a small SFT model with DPO against a frozen reference shifts its outputs toward the chosen style on held-out prompts — demonstrating preference alignment without any reward model or RL loop, the core appeal of DPO.

Exercise 14Code

Track absolute chosen/rejected log-probs; demonstrate the 'both drop' failure by overtraining.

Solution

Logging the absolute log-probabilities during training and overtraining reproduces Exercise 7's failure: the rejected log-prob plummets while the chosen also declines, the margin still grows, and quality degrades — showing why monitoring absolute (not just relative) log-probs matters.

Exercise 15Code

Sweep β; plot policy drift (KL from reference) vs β.

Solution

Running DPO at several β values and measuring KL(π‖π_ref) shows smaller β → larger drift (more change from the reference) and larger β → smaller drift — the empirical version of Exercise 6's leash analogy.

Exercise 16Code

Implement the IPO loss; compare overfitting to DPO on the same data.

Solution

Swapping in IPO's squared-error-to-target loss (Exercise 9) and training on the same data shows IPO resisting the margin runaway and overfitting that DPO exhibits when overtrained — a direct demonstration of the bounded-objective fix.

Exercise 17Code

Implement KTO on UNPAIRED good/bad labels; align without preference pairs.

Solution

KTO uses individually-labeled good/bad responses (no pairs) with a prospect-theory-inspired loss, aligning the model from thumbs-up/down style data. Showing it improves alignment without constructed pairs demonstrates the unpaired-data setting of Exercise 10.

Exercise 18Code

Demonstrate the length-bias trap: longer chosen responses → verbose model; length-balance fixes it.

Solution

If chosen responses are systematically longer, DPO learns that 'longer = better' and the model becomes verbose. Length-balancing the preference data (matching chosen/rejected lengths) shrinks the effect — a concrete demonstration of how dataset artifacts leak into the aligned model and how to mitigate them.

Exercise 19Code Lab

Combine DPO with LoRA: policy = base+LoRA, reference = frozen base; confirm adapter-only training and memory savings.

Solution

Using the frozen base as BOTH the reference and the LoRA backbone means only the adapter trains while the base serves double duty — a memory-efficient DPO that needs no separate reference copy. Confirming only the adapter receives gradients and reporting the memory savings shows DPO+LoRA is a practical, lightweight alignment recipe.

Exercise 20Code (Challenge)

DPO vs GRPO head-to-head on the same data/reward; compare quality, stability, time, memory; recommend.

Solution

Aligning the same SFT model with offline DPO and online GRPO and comparing on a held-out judge shows the offline-vs-online trade-off (Exercise 8): DPO is simpler, faster, and stabler but capped by the dataset; GRPO can exceed it via exploration at higher cost/complexity. A sensible recommendation: DPO for general alignment where good preference data exists, online RL (GRPO) for reasoning or when you can verify/score fresh samples.

←

ReturnAppendix Index

ReviewBack to Chapter 24

→