Solutions Appendix
Chapter 22

Supervised Fine-Tuning

22 Solutions

Detailed solutions for the exercises in Chapter 22. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Why does a base model 'continue' rather than 'answer'? Give a divergent prompt.

Solution

A base model is trained only to predict the next token over web text, so given 'What is the capital of France?' it may continue with more questions (as a quiz page would) rather than answer. Prompt where they diverge: 'List three tips for a job interview:' — a base model might continue the document (adding more list items unrelated to helping you, or drifting), while an instruction-tuned model directly provides three helpful tips. SFT teaches the model that an instruction should be followed, not merely continued.

Exercise 2Pen & Paper
State the superficial alignment hypothesis; implications for SFT data quantity and quality.

Solution

The hypothesis (Zhou et al., LIMA) holds that a model's knowledge and capabilities are learned almost entirely during pretraining, and alignment merely teaches it the FORMAT and STYLE of helpful responses — surfacing existing abilities, not adding new ones. Implication: you need relatively little SFT data (thousands, not millions of examples), but it must be high-quality and diverse, because you are teaching a behavior template; a few excellent examples beat a flood of mediocre ones.

Exercise 3Pen & Paper
Why mask prompt tokens in the SFT loss? What goes wrong if you train on the full sequence?

Solution

We compute loss only on the response tokens so the model learns to GENERATE good answers, not to generate the user's instructions. Training on the full sequence (including the prompt) wastes capacity teaching the model to predict user inputs — which it will never need to do at inference — and can bias it toward parroting or completing prompts rather than responding to them. Masking focuses the gradient on the behavior we actually want.

Exercise 4Pen & Paper
Write the SFT loss; how does it differ from pretraining loss? What stays the same?

Solution

SFT loss = −Σ_{t ∈ response} log P(x_t | x_{

Exercise 5Pen & Paper
Why does the wrong chat template degrade responses? Why never hand-build it?

Solution

The model learned to respond conditioned on EXACT special tokens and formatting (role markers, separators) seen during instruction tuning. A mismatched template puts the model in an out-of-distribution context, degrading its behavior — it may not recognize where the user turn ends and its turn begins. You should use the tokenizer's built-in apply_chat_template rather than hand-building, because the exact tokens, spacing, and special markers are easy to get subtly wrong, and any deviation hurts.

Exercise 6Pen & Paper
Three problems with full fine-tuning and how PEFT addresses each.

Solution

(1) Memory — full fine-tuning needs optimizer state for all parameters; PEFT trains a tiny fraction, slashing optimizer memory. (2) Storage — each full fine-tune is a full model copy; PEFT stores only small adapters (megabytes) per task. (3) Catastrophic forgetting — updating all weights can erode pretrained abilities; PEFT freezes the base, preserving it. PEFT (e.g. LoRA) gives most of the benefit of fine-tuning at a fraction of the cost and risk.

Exercise 7Derive
Derive LoRA's 2dr trainable params for a d×d weight; for d=4096, r=8, the reduction factor.

Solution

LoRA freezes W and learns ΔW = BA with B (d×r) and A (r×d): trainable = dr + rd = 2dr. For d=4096, r=8: 2·4096·8 = 65,536. Full fine-tuning trains d² = 16.78M. Reduction = d²/(2dr) = d/(2r) = 4096/16 = 256× fewer trainable parameters.

Exercise 8Pen & Paper
Why initialize LoRA B=0 and A small-random? What does the model compute at the start?

Solution

With B=0, the update ΔW = BA = 0 at initialization, so the model's output exactly equals the frozen base model's at step 0 — training starts from the pretrained behavior, not a perturbed one. A is small-random (not zero) so that gradients can flow into it once B begins to move (if both were zero, the symmetric product would have no gradient signal). This makes LoRA a safe, smooth departure from the base model.

Exercise 9Pen & Paper
Explain LoRA's alpha/r scaling; why divide by r; effect of doubling alpha.

Solution

LoRA scales its update by alpha/r: ΔW = (alpha/r)·BA. Dividing by r keeps the update's magnitude roughly constant as you change the rank, so you can tune r without re-tuning the learning rate. alpha sets the overall strength of the adapter; doubling alpha doubles the update's influence on the output (a stronger adaptation), equivalent to scaling the effective learning rate for the adapter.

Exercise 10Pen & Paper
Why can LoRA use a much higher learning rate than full fine-tuning?

Solution

Because the pretrained weights are FROZEN, there is no risk of a large step damaging them — only the small, freshly-initialized adapter parameters are updated. Those few parameters can tolerate (and benefit from) aggressive learning rates to adapt quickly, whereas full fine-tuning must use a small rate to avoid disturbing the delicate pretrained representations. Freezing the base de-risks large steps on the adapter.

Exercise 11Pen & Paper
How does QLoRA fit a 70B model on one GPU? Why is 4-bit base + bf16 adapters acceptable?

Solution

QLoRA quantizes the frozen base weights to 4-bit (NF4), cutting their memory ~4× (a 70B model fits in ~35–40 GB), and trains only small bf16 LoRA adapters on top. It is acceptable because the base is frozen — quantization error in fixed weights is tolerable since they aren't being optimized — while the adapters, which DO learn, stay in higher precision (bf16) so their gradients are accurate. Quantize what's static; keep precision where learning happens.

Exercise 12Code
Implement prepare_sft_example with prompt masking (labels=−100 on prompt); verify only response contributes.

Solution

Set the label of every prompt token to −100 (the ignore index) so the loss skips them, keeping real token IDs only for the response. Verifying that the loss is unchanged when prompt tokens are altered (but changes with response tokens) confirms only the response contributes gradient — Exercise 3 in code.

Exercise 13Code
Show a base model continuing, then fine-tune so it answers.

Solution

Before SFT, prompting the base model with an instruction yields a continuation (more text in the document's style); after fine-tuning on a handful of instruction–response pairs, the same prompt yields a direct answer — a concrete demonstration of the continue-vs-answer distinction of Exercise 1 and the format-teaching role of SFT.

Exercise 14Code
Use apply_chat_template on a multi-turn conversation; contrast a wrong hand-built template.

Solution

apply_chat_template inserts the model's exact role markers and special tokens for each turn. Printing them and comparing to a hand-built template (wrong spacing or markers) shows how easily the format diverges — and (Exercise 5) why such divergence degrades responses.

Exercise 15Code
Implement multi-turn loss masking: −100 on system/user tokens, keep assistant tokens.

Solution

Walking the conversation and masking every system and user token (and the role markers) while keeping only assistant-response tokens ensures the model trains to produce assistant turns, not to predict user inputs — the multi-turn generalization of prompt masking.

Exercise 16Code Lab
Build the full SFT loop; fine-tune a small base model; show before/after on held-out prompts.

Solution

Assembling tokenization, prompt masking, and the next-token loss into a training loop and fine-tuning on a curated instruction set produces a model that, on held-out prompts, now follows instructions where the base model merely continued — the end-to-end SFT result.

Exercise 17Code
Implement LoRALinear; verify output = base at init and only A,B get gradients.

Solution

With B=0 at init, the LoRALinear output equals the frozen base layer's (Exercise 8); checking that gradients flow only to A and B (the base requires_grad=False) confirms parameter-efficient training. The layer adds (alpha/r)·x·Aᵀ·Bᵀ to the frozen projection.

Exercise 18Code
Wrap attention projections with LoRA; confirm <1% of parameters train.

Solution

Applying LoRA (small r) to the query/value projections and calling print_trainable_parameters shows well under 1% of parameters are trainable — the dramatic reduction of Exercise 7 realized on a real model.

Exercise 19Code Lab
Full fine-tuning vs LoRA: compare memory, speed, quality, checkpoint size.

Solution

LoRA uses far less peak memory (no optimizer state for the frozen base), trains comparably fast, reaches similar quality on the task, and produces tiny checkpoints (adapters only, megabytes vs gigabytes) — quantifying the PEFT advantages of Exercise 6.

Exercise 20Code
Demonstrate LoRA merging: merge BA into base weights; verify identical outputs.

Solution

Folding W' = W + (alpha/r)BA into the base weights produces a standard model with no adapter at inference; verifying its outputs match the unmerged LoRA model confirms merging is exact and yields zero inference overhead — a key deployment convenience.

Exercise 21Code Lab
Set up QLoRA (4-bit base via bitsandbytes + PEFT); fine-tune a model that wouldn't fit in bf16; report savings.

Solution

Loading the base in 4-bit NF4 and attaching bf16 LoRA adapters lets a model that exceeds GPU memory in bf16 be fine-tuned; reporting the memory before/after shows the ~4× reduction in base-weight memory that makes single-GPU fine-tuning of large models possible (Exercise 11).

Exercise 22Code (Challenge)
Full SFT pipeline; then add 200 low-quality examples and show degradation — demonstrating the superficial alignment hypothesis.

Solution

Curating ~500 high-quality pairs, LoRA-fine-tuning, and judging on held-out prompts gives a strong instruction-follower. Injecting 200 low-quality examples visibly degrades response quality — demonstrating that SFT teaches behavior/format and that DATA QUALITY, not quantity, dominates (Exercise 2). A little bad data poisons the behavior template.