Direct Preference Optimization
Detailed solutions for the exercises in Chapter 24. Try solving them yourself before checking the answers.
Solution
DPO eliminates the separately-trained reward model, the value/critic model, and the entire online RL sampling loop. It keeps (1) the preference data and (2) the frozen reference model. The reference matters because DPO's implicit reward is defined RELATIVE to it (β·log(π/π_ref)); the reference anchors the policy (playing the role of the KL penalty), preventing it from drifting arbitrarily and giving the implicit reward a meaningful baseline.
Solution
DPO's key insight is that the optimal RLHF policy and its reward are two views of the same thing: given the closed-form optimal policy under a KL-penalized reward, you can invert it to read the reward off the policy. The implicit reward of a response is r(x,y) = β·log(π(y|x)/π_ref(y|x)) (up to a prompt-dependent constant) — how much more likely the policy makes the response than the reference does. So a model that prefers a response IS, implicitly, assigning it higher reward.
Solution
Taking logs of π*(y|x) = (1/Z(x))·π_ref(y|x)·exp(r(x,y)/β): logπ* = logπ_ref + r/β − logZ(x). Solving for r:
The last term βlogZ(x) depends only on the prompt x, not the response — a constant across responses to the same prompt, which is exactly why it will cancel (Exercise 4).
Solution
Bradley-Terry preference probability is σ(r_w − r_l). Substituting r = βlog(π/π_ref) + βlogZ(x), the prompt-dependent βlogZ(x) appears in BOTH r_w and r_l and cancels in the difference. This leaves the DPO loss:
a simple supervised loss on the policy, with no reward model needed — the cancellation of the partition function is what makes DPO work.
Solution
Plain SFT only raises the likelihood of good (chosen) responses. DPO is contrastive: it simultaneously raises the chosen response's likelihood (relative to the reference) and LOWERS the rejected response's — using the NEGATIVE signal (what not to do) that SFT lacks. It learns from pairs, pushing chosen and rejected apart, rather than only imitating positives.
Solution
β controls how strongly the policy is tied to the reference. As β→0, the policy can move arbitrarily far to satisfy preferences (no anchoring — prone to overfitting/degeneration). As β grows large, the policy stays very close to the reference (minimal change). β plays exactly the role of the inverse KL-penalty weight in RLHF: small β = weak leash (large allowed KL), large β = strong leash.
Solution
DPO only requires the MARGIN log(π_w/π_ref_w) − log(π_l/π_ref_l) to grow; it can achieve this by lowering the rejected probability FASTER than the chosen — so both absolute probabilities fall while the margin still increases. The model becomes less likely to produce the chosen (good) responses too, hurting quality. Monitor the ABSOLUTE log-probabilities of chosen responses (not just the margin/loss); if the chosen log-prob is dropping, you are in this failure mode.
Solution
DPO is offline: it learns from a FIXED preference dataset, so it can only be as good as the responses in that data. Online methods (PPO/GRPO) generate FRESH responses from the current policy and get them scored each step, so they can explore and improve beyond any pre-collected responses — discovering better outputs the dataset never contained. Online's exploration is its edge; offline's simplicity and stability are DPO's.
Solution
DPO's sigmoid loss keeps rewarding ever-larger margins (it never saturates the incentive to push chosen/rejected further apart), so it can overfit and degenerate. IPO replaces the sigmoid log-loss with a SQUARED-ERROR to a fixed target margin: once the margin reaches the target, the loss is minimized and there is no incentive to push further. This bounded objective prevents the runaway margin growth that causes DPO's overfitting.
Solution
DPO: needs paired preferences + reference; good general default. IPO: paired + reference, with overfitting control; best when DPO overfits. KTO: needs only UNPAIRED good/bad labels (no pairs) + reference; best when you have thumbs-up/down data, not comparisons. ORPO: no separate reference model and combines SFT + preference in one stage; best when you want a single-stage, reference-free pipeline. Choose by what data you have (paired vs unpaired) and whether you want to avoid a reference model.
Solution
Summing the per-token log-probabilities of a response under the model (masking the prompt) gives the sequence log-prob. Checking it against a hand-computed value on a tiny example confirms correctness — the building block for DPO's implicit reward (Exercise 2).
Solution
Implementing −logσ(β(Δ_w − Δ_l)) where Δ = logprob_policy − logprob_ref (Exercise 4), and checking that a gradient step raises the chosen response's log-prob and lowers the rejected's, validates the loss does what it should.
Solution
Fine-tuning a small SFT model with DPO against a frozen reference shifts its outputs toward the chosen style on held-out prompts — demonstrating preference alignment without any reward model or RL loop, the core appeal of DPO.
Solution
Logging the absolute log-probabilities during training and overtraining reproduces Exercise 7's failure: the rejected log-prob plummets while the chosen also declines, the margin still grows, and quality degrades — showing why monitoring absolute (not just relative) log-probs matters.
Solution
Running DPO at several β values and measuring KL(π‖π_ref) shows smaller β → larger drift (more change from the reference) and larger β → smaller drift — the empirical version of Exercise 6's leash analogy.
Solution
Swapping in IPO's squared-error-to-target loss (Exercise 9) and training on the same data shows IPO resisting the margin runaway and overfitting that DPO exhibits when overtrained — a direct demonstration of the bounded-objective fix.
Solution
KTO uses individually-labeled good/bad responses (no pairs) with a prospect-theory-inspired loss, aligning the model from thumbs-up/down style data. Showing it improves alignment without constructed pairs demonstrates the unpaired-data setting of Exercise 10.
Solution
If chosen responses are systematically longer, DPO learns that 'longer = better' and the model becomes verbose. Length-balancing the preference data (matching chosen/rejected lengths) shrinks the effect — a concrete demonstration of how dataset artifacts leak into the aligned model and how to mitigate them.
Solution
Using the frozen base as BOTH the reference and the LoRA backbone means only the adapter trains while the base serves double duty — a memory-efficient DPO that needs no separate reference copy. Confirming only the adapter receives gradients and reporting the memory savings shows DPO+LoRA is a practical, lightweight alignment recipe.
Solution
Aligning the same SFT model with offline DPO and online GRPO and comparing on a held-out judge shows the offline-vs-online trade-off (Exercise 8): DPO is simpler, faster, and stabler but capped by the dataset; GRPO can exceed it via exploration at higher cost/complexity. A sensible recommendation: DPO for general alignment where good preference data exists, online RL (GRPO) for reasoning or when you can verify/score fresh samples.