Supervised Fine-Tuning
At the end of Part IV you had a fully pretrained base model. It is a marvel: trained on trillions of tokens, it can complete text with uncanny fluency, has absorbed a vast amount of world knowledge, and predicts the next token better than any system before it. And yet, if you tried to USE it as a chatbot, you would be disappointed. This chapter explains why — and how to fix it.
The reason is simple once you see it. A base model was trained on one objective and one objective only: predict the next token in web text. It is a text-completion engine. It has never been taught to answer questions, follow instructions, or hold a conversation. It only knows how to continue whatever text it is given, in the style of its training data.
What a Base Model Actually Does
Suppose you give a base model the prompt 'What is the capital of France?'. You might expect 'Paris'. But the base model does not 'answer' — it CONTINUES. On the web, a line like that is often followed by more quiz questions, so the base model might continue with another question instead of answering:
Neither response is 'wrong' as text completion — the base model is doing exactly what it was trained to do. The problem is a MISMATCH between what it was trained to do (continue text) and what we want it to do (helpfully answer). Supervised fine-tuning is how we close that gap: we take the capable base model and teach it the NEW behaviour of being a helpful assistant.
The Two-Stage Recipe
Modern assistant models are built in two big stages. First, PRETRAINING (Parts III–IV) gives the model its raw capabilities and knowledge — this is the expensive part, costing millions of dollars. Then, POST-TRAINING (this Part V) shapes that raw capability into helpful, harmless, honest behaviour — this is comparatively cheap, but it is what makes the model usable.
Pipeline Flow: The path from raw text to helpful assistant
| 1 | Pretraining | Predict next token on trillions of web tokens → a capable but raw base model (Parts III–IV) |
| 2 | SFT | Fine-tune on instruction-response demonstrations → a model that follows instructions (this chapter) |
| 3 | Preference tuning | RLHF or DPO aligns the model with human preferences (Chapters 23–24) |
| 4 | Safety tuning | Constitutional methods make it harmless and honest (Chapter 26) |
Instruction tuning — the most common form of SFT — is conceptually the simplest idea in this entire book. We show the model many examples of instructions paired with good responses, and we train it to produce those responses. That is it. The model learns, by imitation, to respond helpfully to instructions it has never seen.
The Data: Instruction-Response Pairs
The training data for instruction tuning is a collection of (instruction, response) pairs. Each pair is a demonstration of the behaviour we want. Here are a few examples of what such pairs look like:
| Instruction (input) | Response (target) |
|---|---|
| Summarize this in one sentence: [long text] | A concise one-sentence summary of the text. |
| Translate 'hello' into French. | Bonjour. |
| Write a haiku about autumn. | Leaves drift to the ground / crimson and gold in the breeze / autumn whispers low |
| Explain photosynthesis to a 5-year-old. | Plants eat sunlight! They use light, water, and air to make their own food... |
| Fix the bug in this code: [code] | The bug is on line 3. Here is the corrected version: [fixed code] |
Notice the diversity: summarization, translation, creative writing, explanation, coding. The model is not trained on one task — it is trained on MANY tasks, all framed as 'here is an instruction, here is a good response'. This diversity is the key. By seeing thousands of different instructions, the model learns the GENERAL skill of 'follow the instruction helpfully', which then transfers to brand-new instructions at test time.
Where Does Instruction Data Come From?
Early instruction-tuning datasets were built by converting existing NLP datasets into instruction format (the FLAN and T0 approach). Modern datasets use a mix of sources: human-written demonstrations (expensive but high-quality), examples distilled from a stronger model (cheap and scalable), and curated, filtered collections. We cover data curation in detail in Section 22.5; for now, the key point is that the data is a set of demonstrations of good behaviour.
Here is good news for the beginner: the SFT training objective is the SAME next-token cross-entropy loss you already know from Chapter 15. We are not learning a new kind of training. We take the instruction-response pairs, format them into sequences, and train the model to predict each next token — exactly as in pretraining. The model architecture does not change at all.
There is just ONE important twist, and understanding it is essential: we usually do NOT want the model to be penalized for failing to predict the INSTRUCTION tokens — only the RESPONSE tokens. We mask out the loss on the prompt. Let us build up to why.
Formatting a Training Example
First, we concatenate the instruction and response into a single sequence, with special markers showing where each begins (more on these markers in Section 22.4). Schematically, one training example becomes:
[INSTRUCTION tokens] [RESPONSE tokens] [END]
Example (simplified):
<user> What is 2+2? <assistant> 2+2 equals 4. <end>
\_____ prompt _____/ \___ response ___/Why Mask the Prompt?
During pretraining, the model predicts EVERY token. But in SFT, the instruction is GIVEN — it is the input, not something the model should learn to generate. If we trained the model to predict the instruction tokens too, we would waste capacity teaching it to generate instructions (which we never want it to do) instead of focusing on generating good responses. So we apply a LOSS MASK: we compute the loss only on the response tokens and ignore the prompt tokens.
import torch
def prepare_sft_example(prompt, response, tokenizer):
"""Tokenize a prompt+response pair and build masked labels."""
prompt_ids = tokenizer.encode(prompt) # the instruction
response_ids = tokenizer.encode(response) # the target answer
# The model sees the full sequence: prompt followed by response
input_ids = prompt_ids + response_ids
# Labels: -100 means 'ignore in the loss'. We mask the prompt so
# the model is only trained to predict the RESPONSE tokens.
labels = ([-100] * len(prompt_ids)) + response_ids
return torch.tensor(input_ids), torch.tensor(labels)
# Example:
# prompt = '<user> What is 2+2? <assistant>' -> labels all -100
# response = ' 2+2 equals 4. <end>' -> labels = the token ids
# PyTorch's cross_entropy(ignore_index=-100) skips the -100 positions,
# so the loss comes only from the response. That's the whole twist.The Loss
L = -(1/|R|) Σ log P(xₜ | x_<ₜ)
t ∈ R
R = the set of RESPONSE token positions only
# Identical to pretraining's loss, but summed over response tokens, not all tokens.In Section 22.3 we used informal markers like '
Why Templates Exist
A conversation has structure: there are turns, and each turn has a ROLE — system (instructions about how to behave), user (the human), or assistant (the model). The model needs to know where each turn begins and ends, and who is speaking. Special tokens mark these boundaries. The model is TRAINED with these exact tokens, so at inference time you must use the SAME tokens, or the model will be confused.
A Concrete Template: ChatML
One widely-used format is ChatML, which wraps each turn in '<|im_start|>' and '<|im_end|>' tokens with the role name. Here is a full conversation in ChatML format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>
# Each turn: <|im_start|>{role}\n{content}<|im_end|>
# The model learns: after '<|im_start|>assistant\n', generate a response,
# then emit '<|im_end|>' to signal it is finished.
# Different model families use DIFFERENT templates -- LLaMA, Mistral, and
# Gemma each have their own. There is no universal standard.Different Models, Different Templates
There is no single standard chat template — each model family invented its own. LLaMA-2 uses '[INST]' and '[/INST]' markers; LLaMA-3 uses a different header-based format; Mistral, Gemma, and others each differ. The content is the same conversation; only the wrapping tokens differ. This is why you must use the template that MATCHES the model you are fine-tuning or running.
apply_chat_template method, which knows the exact format for that model. This single habit prevents a whole class of frustrating, hard-to-diagnose quality problems.from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')
messages = [
{'role': 'system', 'content': 'You are helpful.'},
{'role': 'user', 'content': 'Hello!'},
{'role': 'assistant', 'content': 'Hi! How can I help?'},
{'role': 'user', 'content': 'Tell me a joke.'},
]
# For TRAINING: tokenize the full conversation
input_ids = tok.apply_chat_template(messages, tokenize=True)
# For INFERENCE: add the generation prompt so the model continues
# as the assistant (adds the '<|im_start|>assistant' opener)
prompt = tok.apply_chat_template(messages, tokenize=False,
add_generation_prompt=True)
# The tokenizer handles the exact special tokens, spacing, and newlines.
# You never need to remember the format -- the tokenizer knows it.If SFT is just training on demonstrations, then the demonstrations are everything. The single most important factor in SFT success is DATA QUALITY. And the most surprising, important finding in this area is that you need far LESS data than you might expect — but it must be GOOD.
The Superficial Alignment Hypothesis
The LIMA paper (Zhou et al., 2023) made a striking claim, backed by experiment: a model fine-tuned on just 1,000 carefully-curated, high-quality examples could rival models trained on tens of thousands. They called the explanation the 'Superficial Alignment Hypothesis': a model's knowledge and capabilities are learned almost entirely during pretraining; SFT mostly teaches the model which of its existing abilities and FORMATS to use when responding.
What Makes SFT Data Good?
| Quality dimension | What to aim for |
|---|---|
| Correctness | Responses are accurate, complete, and genuinely helpful |
| Diversity | Many different tasks, topics, formats, and difficulty levels |
| Format consistency | Responses follow a consistent, clean style and structure |
| Appropriate length | Detailed enough to help, not padded with fluff |
| Helpful tone | Polite, direct, well-organized — the behaviour you want |
| No contradictions | The dataset does not teach conflicting behaviours |
Sources of SFT Data
There are three main ways to obtain instruction data, each with trade-offs:
age = 25, the variable age holds the value 25.The two responses above answer the same instruction, but the left (chosen) one is the kind of demonstration you want in your SFT set: accurate, complete, well-structured, with a concrete example. The right (rejected) one is vague, lowercase, and unhelpful. Training on responses like the right one teaches the model to be vague and unhelpful. This is why curation — actively keeping the good and discarding the bad — is the heart of SFT data work.
We now have all the pieces: the data (instruction-response pairs), the format (chat template), and the objective (masked next-token loss). Let us assemble them into a complete, working SFT training loop. We will build it step by step so every line is clear.
Step 1: Prepare the Dataset
import torch; from torch.utils.data import Dataset
class SFTDataset(Dataset):
def __init__(self, conversations, tokenizer, max_len=2048):
self.examples = []
for conv in conversations:
# Tokenize the full conversation with the chat template
ids = tokenizer.apply_chat_template(conv, tokenize=True)
labels = self._mask_non_assistant(conv, ids, tokenizer)
self.examples.append((ids[:max_len], labels[:max_len]))
def _mask_non_assistant(self, conv, ids, tok):
"""Set labels to -100 everywhere except assistant responses."""
labels = list(ids)
# (in practice: walk the conversation, find assistant spans,
# and set every NON-assistant token's label to -100)
return labels
def __len__(self): return len(self.examples)
def __getitem__(self, i): return self.examples[i]Step 2: The Training Loop
The training loop is almost identical to the pretraining loop of Chapter 15. The only differences are the masked labels and the much smaller dataset and learning rate. Notice how familiar this looks — SFT really is just focused fine-tuning.
import torch; import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
# 1. Load the PRETRAINED base model and its tokenizer
model = AutoModelForCausalLM.from_pretrained('base-model').cuda()
tok = AutoTokenizer.from_pretrained('base-model')
# 2. Build the SFT dataset and loader
dataset = SFTDataset(conversations, tok)
loader = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=pad_collate)
# 3. SFT uses a SMALL learning rate -- we are gently nudging, not
# retraining. Too high and we destroy pretrained knowledge.
opt = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.0)
# 4. Train for just 1-3 epochs -- more causes overfitting
for epoch in range(3):
for input_ids, labels in loader:
input_ids, labels = input_ids.cuda(), labels.cuda()
with torch.autocast('cuda', dtype=torch.bfloat16):
logits = model(input_ids).logits
# Shift for next-token prediction, then masked cross-entropy
loss = F.cross_entropy(
logits[:, :-1].reshape(-1, logits.size(-1)),
labels[:, 1:].reshape(-1),
ignore_index=-100, # <- the masking happens here
)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step(); opt.zero_grad()
# That's it. The model is now an instruction-follower.The training loop in Section 22.6 does FULL fine-tuning: it updates every one of the model's billions of parameters. This works, but it has serious practical problems that motivate the parameter-efficient methods in the rest of the chapter.
Problem 1: Memory Cost
Recall the memory budget from Chapter 18: full fine-tuning with AdamW needs about 16 bytes per parameter (the model, gradients, and two optimizer moments, plus the fp32 master copy). For a 7-billion-parameter model, that is about 112GB — more than a single 80GB GPU can hold. To fully fine-tune even a modest model, you need multiple expensive GPUs and the distributed-training machinery of Chapter 18.
Memory ≈ 16 bytes × N parameters (+ activations)
7B model: 16 × 7e9 = 112 GB (exceeds one 80GB GPU)
70B model: 16 × 70e9 = 1,120 GB (needs many GPUs)
Most people cannot afford this just to fine-tune.Problem 2: Storage Cost
Full fine-tuning produces a complete copy of the model for each task. If you fine-tune a 7B model for ten different tasks, you have ten 14GB checkpoints — 140GB of storage, and you must load a whole new model to switch tasks. This does not scale when you want many specialized variants.
Problem 3: Catastrophic Forgetting
When you update all parameters on a narrow SFT dataset, the model can 'forget' some of its pretrained knowledge — a phenomenon called catastrophic forgetting. The aggressive updates that teach the new behaviour can overwrite capabilities the model had before. Conservative learning rates help, but the risk grows with the amount of fine-tuning.
LoRA (Low-Rank Adaptation; Hu et al., 2021) is the most important parameter-efficient fine-tuning method. It is elegant, effective, and widely used. We met it briefly in Chapter 20; here we develop it carefully and from the ground up, because understanding it well is essential for modern fine-tuning.
The Key Observation
Full fine-tuning learns an update ΔW to each weight matrix W, giving a new weight W + ΔW. The update ΔW has the same shape as W — for a 4096×4096 matrix, that is ~16.8 million numbers per matrix. LoRA's insight: this update ΔW tends to have LOW RANK. That means it can be FACTORED into the product of two much smaller matrices, capturing almost the same change with far fewer numbers.
Full fine-tuning: W_new = W + ΔW (ΔW is d×d, large)
LoRA: ΔW = B A (low-rank factorization)
B is d×r, A is r×d, with rank r ≪ d
W_new = W + B A
Trainable params: 2·d·r instead of d²Let us make the savings concrete. For a 4096×4096 weight matrix (d = 4096) with LoRA rank r = 8: full fine-tuning trains 4096×4096 ≈ 16.8 million parameters, while LoRA trains 2×4096×8 = 65,536 parameters — a 256× reduction. And because the original W is FROZEN, we only need to store and optimize those 65k LoRA parameters per matrix, not the whole model.
How LoRA Works in the Forward Pass
During the forward pass, instead of computing y = Wx, LoRA computes y = Wx + B(Ax). The frozen W does its usual work, and the small low-rank path B(Ax) adds the learned adjustment. Crucially, only A and B receive gradients; W never changes. Here is the structure:
Arch Stack: LoRA: a frozen weight with a small trainable side-path
| output y = Wx + BAx | (d,) |
| + add the two paths | |
| B (d×r, trainable) | up-project r→d |
| A (r×d, trainable) | down-project d→r |
| W (d×d, FROZEN) | the pretrained weight |
| input x | (d,) |
The Two Hyperparameters: Rank r and Alpha
LoRA has two main knobs that beginners must understand:
y = W x + (α / r) · B A x
α/r is the scaling factor:
larger α → the LoRA update has more influence
dividing by r keeps the scale stable as you change rankInitialization: Starting From the Original Model
A subtle but important detail: LoRA initializes B to ZERO and A to small random values. This means that at the start of training, BA = 0, so the model behaves EXACTLY like the original pretrained model. Training then gradually grows the update from zero. This is why LoRA fine-tuning is stable: it begins as the unchanged base model and departs from it smoothly.
import torch; import torch.nn as nn
class LoRALinear(nn.Module):
"""Wraps a frozen linear layer with a trainable low-rank update."""
def __init__(self, base_linear, r=8, alpha=16):
super().__init__()
self.base = base_linear
for p in self.base.parameters():
p.requires_grad = False # FREEZE the original weight
d_in = base_linear.in_features
d_out = base_linear.out_features
self.r = r
self.scaling = alpha / r # the alpha/r scaling
# A: small random (down-projection), B: ZERO (up-projection)
self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
self.B = nn.Parameter(torch.zeros(d_out, r)) # B=0 -> starts as base model
def forward(self, x):
# frozen base path + scaled low-rank update
return self.base(x) + self.scaling * (x @ self.A.T @ self.B.T)
# Only A and B have requires_grad=True, so the optimizer updates
# just 2*d*r parameters per layer. The base model is untouched.
# At init B=0, so output = base(x) exactly -- training departs smoothly.Merging: Zero Inference Cost
A wonderful property of LoRA: after training, you can MERGE the update back into the original weights by computing W + (α/r)BA once, producing a normal weight matrix. The merged model has the EXACT same architecture and inference speed as the original — the LoRA path adds zero inference cost. You can also keep the adapter separate and swap different LoRA adapters in and out to switch tasks instantly, since each adapter is just a few megabytes.
LoRA dramatically reduces the TRAINABLE parameters, but you still need to hold the full frozen base model in memory for the forward and backward passes. For a 70B model in bf16, that frozen model alone is ~140GB — still too large for a single GPU. QLoRA (Dettmers et al., 2023) solves this with a clever combination: store the frozen base model in 4-BIT precision, while training the LoRA adapters in higher precision.
The QLoRA Idea
Quantization (covered fully in Chapter 27) stores numbers with fewer bits. QLoRA quantizes the frozen base model to just 4 bits per parameter — a 4× reduction from bf16's 16 bits. Since the base model is frozen and never updated, the precision loss from quantizing it is tolerable. The LoRA adapters, which ARE trained, stay in higher precision (bf16) so their gradients are accurate. The result: a 70B model fine-tunable on a single 48GB GPU.
Frozen base (4-bit): 0.5 bytes/param
70B model: 0.5 × 70e9 = 35 GB (fits on one 48GB GPU!)
LoRA adapters (bf16): tiny (a few hundred MB)
Optimizer state: only for the small adapters, not the base
vs full fine-tuning: ~1,120 GB. QLoRA: ~40 GB total.Three Innovations in QLoRA
| Innovation | What it does |
|---|---|
| 4-bit NormalFloat (NF4) | A 4-bit data type optimized for the bell-curve distribution of neural network weights, more accurate than naive 4-bit |
| Double quantization | Quantizes the quantization constants too, saving a little more memory |
| Paged optimizers | Offloads optimizer state to CPU memory during spikes, preventing out-of-memory crashes |
import torch; from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 1. Configure 4-bit (NF4) quantization for the FROZEN base model
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4', # NormalFloat-4
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16
bnb_4bit_use_double_quant=True, # double quantization
)
model = AutoModelForCausalLM.from_pretrained('big-model', quantization_config=bnb)
# 2. Configure LoRA: which layers, what rank
lora = LoraConfig(
r=16, # rank
lora_alpha=32, # alpha (= 2r here)
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'], # attention projs
lora_dropout=0.05,
task_type='CAUSAL_LM',
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 8.4M || all params: 7,000M || trainable%: 0.12%
# 3. Train EXACTLY like Section 22.6 -- the loop is unchanged.
# Only 0.12% of parameters are trained, on a 4-bit base, on one GPU.LoRA is the most popular PEFT method, but it is part of a broader family. Understanding the alternatives helps you appreciate why LoRA won and when another method might fit better. All share the same goal: adapt a frozen model by training only a small number of new parameters.
| Method | How it adapts | Notes |
|---|---|---|
| LoRA | Low-rank update added to weights | Most popular; mergeable, no inference cost |
| QLoRA | LoRA + 4-bit frozen base | Fits big models on one GPU |
| Adapters | Small new layers inserted in blocks | Adds inference latency |
| Prefix tuning | Trainable 'virtual tokens' prepended | Prepends to keys/values |
| Prompt tuning | Trainable soft-prompt embeddings | Simplest; weakest for hard tasks |
| IA3 | Learned scaling vectors for activations | Very few parameters |
| DoRA | Decomposes weight into magnitude+direction | LoRA refinement, often better |
Adapters: The Original PEFT
Adapters (Houlsby et al., 2019) were the original parameter-efficient method: small bottleneck layers (down-project, nonlinearity, up-project) inserted inside each Transformer block, with the rest of the model frozen. They work well but, unlike LoRA, add extra layers that increase inference latency and cannot be merged away. LoRA's mergeability — zero inference cost — is a major reason it became dominant.
Prompt and Prefix Tuning: Adapting the Input
Prompt tuning and prefix tuning take a different approach: instead of modifying the weights, they prepend trainable 'soft' vectors to the input (prompt tuning) or to the attention keys and values (prefix tuning). The model weights are entirely frozen; only these few prepended vectors are learned. They are extremely parameter-efficient but generally less powerful than LoRA for difficult adaptations.
Let us consolidate everything into a practical recipe you could follow to fine-tune a base model into an instruction-following assistant. This section is the hands-on synthesis of the chapter.
The Recipe
Pipeline Flow: A complete SFT workflow
| 1 | Choose base | Pick a pretrained base model of the right size for your compute |
| 2 | Curate data | Assemble a few thousand high-quality, diverse instruction-response pairs |
| 3 | Format | Apply the model's chat template; mask non-assistant tokens |
| 4 | Pick PEFT | LoRA (rank 8–16) on attention projections; QLoRA if memory-bound |
| 5 | Train | 1–3 epochs, lr ~1e-5 to 2e-4 (higher for LoRA), bf16, grad-clip 1.0 |
| 6 | Evaluate | Generate from held-out prompts; check helpfulness and format |
| 7 | Iterate | Fix data issues, adjust rank/lr, repeat — data fixes beat hyperparameter tweaks |
Hyperparameters That Matter
| Hyperparameter | Typical value | Guidance |
|---|---|---|
| Learning rate (full FT) | 1e-5 to 2e-5 | Small — avoid erasing pretrained knowledge |
| Learning rate (LoRA) | 1e-4 to 3e-4 | Higher — only adapters train, base is safe |
| Epochs | 1 to 3 | More overfits; watch validation loss |
| LoRA rank r | 8 to 16 | Higher for harder adaptations |
| LoRA alpha | r to 2r | Common convention; scales the update |
| Batch size | as large as fits | Use gradient accumulation if needed |
| Warmup | ~3% of steps | Brief warmup stabilizes the start |
| Max sequence length | 2k to 8k | Cover your longest conversations |
After fine-tuning, how do you know it worked? And when results disappoint, how do you diagnose the cause? SFT has characteristic success signals and failure modes, and recognizing them turns debugging from guesswork into method.
Signs It Worked
Common Failure Modes and Their Causes
| Symptom | Likely cause | Fix |
|---|---|---|
| Still completes, doesn't answer | Too little SFT / lr too low | More epochs or higher lr |
| Repeats or never stops | End token not learned / not masked | Check template & EOS in data |
| Robotic, overfit responses | Trained too long / too little data | Fewer epochs, more diverse data |
| Lost knowledge/capability | Catastrophic forgetting | Lower lr, use LoRA, less data |
| Adopts a bad quirk | The quirk is in the training data | Audit and clean the data |
| Ignores system prompt | System turns not in training format | Include system turns in SFT data |
| Garbled output | Wrong chat template at inference | Use apply_chat_template |
How to Evaluate
Evaluation of an instruction-tuned model is harder than perplexity (Chapter 21), because we care about open-ended helpfulness, not just prediction. The main approaches: hold out some instruction-response pairs and check the model's responses qualitatively; use an automated 'LLM-as-judge' where a strong model rates response quality; and run instruction-following benchmarks (like IFEval or AlpacaEval) that score how well the model obeys instructions. Reading actual generated responses remains the single most informative check.
SFT Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Base vs assistant | Base completes; SFT teaches it to answer | SFT elicits, doesn't teach knowledge |
| Instruction tuning | Train on (instruction, response) pairs | Diversity drives generalization |
| SFT objective | Masked next-token cross-entropy | Mask the prompt; loss on response |
| Chat templates | Exact special-token format per model | Use apply_chat_template |
| Data quality | Few great examples beat many bad | Superficial alignment hypothesis |
| LoRA | Low-rank trainable update, base frozen | 2dr params; mergeable; α/r scaling |
| QLoRA | LoRA + 4-bit frozen base | Big models on one GPU |
| Hyperparameters | Low lr (full) / high lr (LoRA), 1–3 epochs | Don't overfit small data |
Exercises
Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.
Further reading: “Training language models to follow instructions with human feedback” (Ouyang et al., 2022, InstructGPT) for the original SFT+RLHF recipe. “LIMA: Less Is More for Alignment” (Zhou et al., 2023) for the superficial alignment hypothesis. “Finetuned Language Models Are Zero-Shot Learners” (Wei et al., 2021, FLAN). “LoRA: Low-Rank Adaptation of Large Language Models” (Hu et al., 2021) and “QLoRA: Efficient Finetuning of Quantized LLMs” (Dettmers et al., 2023). “Parameter-Efficient Transfer Learning for NLP” (Houlsby et al., 2019, adapters). The Hugging Face PEFT and TRL library documentation for hands-on tooling.
Next → Chapter 23: Reinforcement Learning from Human Feedback
SFT taught the model to follow instructions by imitating good demonstrations — but imitation has limits. The model can only be as good as its demonstrations, and for many qualities (helpfulness, harmlessness, nuanced judgment) it is far easier for humans to COMPARE two responses than to WRITE the ideal one. Chapter 23 introduces Reinforcement Learning from Human Feedback (RLHF): instead of imitating demonstrations, we train a reward model on human PREFERENCES between responses, then use reinforcement learning to optimize the model against that reward. This is how SFT models become the polished, aligned assistants you actually interact with — and the preference pairs you saw in this chapter become the central training signal.