Supervised Fine-Tuning
Detailed solutions for the exercises in Chapter 22. Try solving them yourself before checking the answers.
Solution
A base model is trained only to predict the next token over web text, so given 'What is the capital of France?' it may continue with more questions (as a quiz page would) rather than answer. Prompt where they diverge: 'List three tips for a job interview:' — a base model might continue the document (adding more list items unrelated to helping you, or drifting), while an instruction-tuned model directly provides three helpful tips. SFT teaches the model that an instruction should be followed, not merely continued.
Solution
The hypothesis (Zhou et al., LIMA) holds that a model's knowledge and capabilities are learned almost entirely during pretraining, and alignment merely teaches it the FORMAT and STYLE of helpful responses — surfacing existing abilities, not adding new ones. Implication: you need relatively little SFT data (thousands, not millions of examples), but it must be high-quality and diverse, because you are teaching a behavior template; a few excellent examples beat a flood of mediocre ones.
Solution
We compute loss only on the response tokens so the model learns to GENERATE good answers, not to generate the user's instructions. Training on the full sequence (including the prompt) wastes capacity teaching the model to predict user inputs — which it will never need to do at inference — and can bias it toward parroting or completing prompts rather than responding to them. Masking focuses the gradient on the behavior we actually want.
Solution
SFT loss = −Σ_{t ∈ response} log P(x_t | x_{
Solution
The model learned to respond conditioned on EXACT special tokens and formatting (role markers, separators) seen during instruction tuning. A mismatched template puts the model in an out-of-distribution context, degrading its behavior — it may not recognize where the user turn ends and its turn begins. You should use the tokenizer's built-in apply_chat_template rather than hand-building, because the exact tokens, spacing, and special markers are easy to get subtly wrong, and any deviation hurts.
Solution
(1) Memory — full fine-tuning needs optimizer state for all parameters; PEFT trains a tiny fraction, slashing optimizer memory. (2) Storage — each full fine-tune is a full model copy; PEFT stores only small adapters (megabytes) per task. (3) Catastrophic forgetting — updating all weights can erode pretrained abilities; PEFT freezes the base, preserving it. PEFT (e.g. LoRA) gives most of the benefit of fine-tuning at a fraction of the cost and risk.
Solution
LoRA freezes W and learns ΔW = BA with B (d×r) and A (r×d): trainable = dr + rd = 2dr. For d=4096, r=8: 2·4096·8 = 65,536. Full fine-tuning trains d² = 16.78M. Reduction = d²/(2dr) = d/(2r) = 4096/16 = 256× fewer trainable parameters.
Solution
With B=0, the update ΔW = BA = 0 at initialization, so the model's output exactly equals the frozen base model's at step 0 — training starts from the pretrained behavior, not a perturbed one. A is small-random (not zero) so that gradients can flow into it once B begins to move (if both were zero, the symmetric product would have no gradient signal). This makes LoRA a safe, smooth departure from the base model.
Solution
LoRA scales its update by alpha/r: ΔW = (alpha/r)·BA. Dividing by r keeps the update's magnitude roughly constant as you change the rank, so you can tune r without re-tuning the learning rate. alpha sets the overall strength of the adapter; doubling alpha doubles the update's influence on the output (a stronger adaptation), equivalent to scaling the effective learning rate for the adapter.
Solution
Because the pretrained weights are FROZEN, there is no risk of a large step damaging them — only the small, freshly-initialized adapter parameters are updated. Those few parameters can tolerate (and benefit from) aggressive learning rates to adapt quickly, whereas full fine-tuning must use a small rate to avoid disturbing the delicate pretrained representations. Freezing the base de-risks large steps on the adapter.
Solution
QLoRA quantizes the frozen base weights to 4-bit (NF4), cutting their memory ~4× (a 70B model fits in ~35–40 GB), and trains only small bf16 LoRA adapters on top. It is acceptable because the base is frozen — quantization error in fixed weights is tolerable since they aren't being optimized — while the adapters, which DO learn, stay in higher precision (bf16) so their gradients are accurate. Quantize what's static; keep precision where learning happens.
Solution
Set the label of every prompt token to −100 (the ignore index) so the loss skips them, keeping real token IDs only for the response. Verifying that the loss is unchanged when prompt tokens are altered (but changes with response tokens) confirms only the response contributes gradient — Exercise 3 in code.
Solution
Before SFT, prompting the base model with an instruction yields a continuation (more text in the document's style); after fine-tuning on a handful of instruction–response pairs, the same prompt yields a direct answer — a concrete demonstration of the continue-vs-answer distinction of Exercise 1 and the format-teaching role of SFT.
Solution
apply_chat_template inserts the model's exact role markers and special tokens for each turn. Printing them and comparing to a hand-built template (wrong spacing or markers) shows how easily the format diverges — and (Exercise 5) why such divergence degrades responses.
Solution
Walking the conversation and masking every system and user token (and the role markers) while keeping only assistant-response tokens ensures the model trains to produce assistant turns, not to predict user inputs — the multi-turn generalization of prompt masking.
Solution
Assembling tokenization, prompt masking, and the next-token loss into a training loop and fine-tuning on a curated instruction set produces a model that, on held-out prompts, now follows instructions where the base model merely continued — the end-to-end SFT result.
Solution
With B=0 at init, the LoRALinear output equals the frozen base layer's (Exercise 8); checking that gradients flow only to A and B (the base requires_grad=False) confirms parameter-efficient training. The layer adds (alpha/r)·x·Aᵀ·Bᵀ to the frozen projection.
Solution
Applying LoRA (small r) to the query/value projections and calling print_trainable_parameters shows well under 1% of parameters are trainable — the dramatic reduction of Exercise 7 realized on a real model.
Solution
LoRA uses far less peak memory (no optimizer state for the frozen base), trains comparably fast, reaches similar quality on the task, and produces tiny checkpoints (adapters only, megabytes vs gigabytes) — quantifying the PEFT advantages of Exercise 6.
Solution
Folding W' = W + (alpha/r)BA into the base weights produces a standard model with no adapter at inference; verifying its outputs match the unmerged LoRA model confirms merging is exact and yields zero inference overhead — a key deployment convenience.
Solution
Loading the base in 4-bit NF4 and attaching bf16 LoRA adapters lets a model that exceeds GPU memory in bf16 be fine-tuned; reporting the memory before/after shows the ~4× reduction in base-weight memory that makes single-GPU fine-tuning of large models possible (Exercise 11).
Solution
Curating ~500 high-quality pairs, LoRA-fine-tuning, and judging on held-out prompts gives a strong instruction-follower. Injecting 200 low-quality examples visibly degrades response quality — demonstrating that SFT teaches behavior/format and that DATA QUALITY, not quantity, dominates (Exercise 2). A little bad data poisons the behavior template.