Part V: Alignment & RLHF
Chapter 22

Supervised Fine-Tuning

Instruction tuning, chat templates, and catastrophic forgetting
22 Exercises
22.1

At the end of Part IV you had a fully pretrained base model. It is a marvel: trained on trillions of tokens, it can complete text with uncanny fluency, has absorbed a vast amount of world knowledge, and predicts the next token better than any system before it. And yet, if you tried to USE it as a chatbot, you would be disappointed. This chapter explains why — and how to fix it.

The reason is simple once you see it. A base model was trained on one objective and one objective only: predict the next token in web text. It is a text-completion engine. It has never been taught to answer questions, follow instructions, or hold a conversation. It only knows how to continue whatever text it is given, in the style of its training data.

What a Base Model Actually Does

Suppose you give a base model the prompt 'What is the capital of France?'. You might expect 'Paris'. But the base model does not 'answer' — it CONTINUES. On the web, a line like that is often followed by more quiz questions, so the base model might continue with another question instead of answering:

Preference Pair
Prompt: What is the capital of France?
Chosen: Paris. (what an instruction-tuned model does — it answers)
Rejected: What is the capital of Germany? What is the capital of Spain? ... (what a base model often does — it continues the pattern of a quiz)

Neither response is 'wrong' as text completion — the base model is doing exactly what it was trained to do. The problem is a MISMATCH between what it was trained to do (continue text) and what we want it to do (helpfully answer). Supervised fine-tuning is how we close that gap: we take the capable base model and teach it the NEW behaviour of being a helpful assistant.

The Two-Stage Recipe

Modern assistant models are built in two big stages. First, PRETRAINING (Parts III–IV) gives the model its raw capabilities and knowledge — this is the expensive part, costing millions of dollars. Then, POST-TRAINING (this Part V) shapes that raw capability into helpful, harmless, honest behaviour — this is comparatively cheap, but it is what makes the model usable.

Pipeline Flow: The path from raw text to helpful assistant

1PretrainingPredict next token on trillions of web tokens → a capable but raw base model (Parts III–IV)
2SFTFine-tune on instruction-response demonstrations → a model that follows instructions (this chapter)
3Preference tuningRLHF or DPO aligns the model with human preferences (Chapters 23–24)
4Safety tuningConstitutional methods make it harmless and honest (Chapter 26)
SFT Is the First and Most Important Post-Training Step
Supervised fine-tuning is where a base model first learns to behave like an assistant. It teaches the model the basic 'shape' of helpful interaction: when given an instruction, produce a helpful response and then stop. Everything after SFT — RLHF, DPO, safety — refines this behaviour, but SFT establishes it.
Remarkably, SFT often requires only a few thousand to a few tens of thousands of examples — a tiny fraction of pretraining data. This is because SFT does not teach the model new knowledge or skills; it teaches the model to ELICIT capabilities it already has, in a helpful format. We will return to this profound idea in Section 22.5.
22.2

Instruction tuning — the most common form of SFT — is conceptually the simplest idea in this entire book. We show the model many examples of instructions paired with good responses, and we train it to produce those responses. That is it. The model learns, by imitation, to respond helpfully to instructions it has never seen.

The Data: Instruction-Response Pairs

The training data for instruction tuning is a collection of (instruction, response) pairs. Each pair is a demonstration of the behaviour we want. Here are a few examples of what such pairs look like:

Instruction (input)Response (target)
Summarize this in one sentence: [long text]A concise one-sentence summary of the text.
Translate 'hello' into French.Bonjour.
Write a haiku about autumn.Leaves drift to the ground / crimson and gold in the breeze / autumn whispers low
Explain photosynthesis to a 5-year-old.Plants eat sunlight! They use light, water, and air to make their own food...
Fix the bug in this code: [code]The bug is on line 3. Here is the corrected version: [fixed code]

Notice the diversity: summarization, translation, creative writing, explanation, coding. The model is not trained on one task — it is trained on MANY tasks, all framed as 'here is an instruction, here is a good response'. This diversity is the key. By seeing thousands of different instructions, the model learns the GENERAL skill of 'follow the instruction helpfully', which then transfers to brand-new instructions at test time.

Intuition: Why Imitation Works: Eliciting, Not Teaching
Here is the crucial intuition for beginners. The base model already KNOWS how to translate, summarize, and write haiku — it learned these abilities from the vast pretraining corpus, which contained translations, summaries, and poems. What it does not know is that, when given an instruction, it should USE these abilities to respond directly.
Instruction tuning does not teach new skills; it teaches the model to RECOGNIZE an instruction and RESPOND in the helpful assistant format. It is like a brilliant but rambling expert who, after a little coaching on how to answer questions directly, becomes a great teacher. The expertise was always there; the coaching just channels it.

Where Does Instruction Data Come From?

Early instruction-tuning datasets were built by converting existing NLP datasets into instruction format (the FLAN and T0 approach). Modern datasets use a mix of sources: human-written demonstrations (expensive but high-quality), examples distilled from a stronger model (cheap and scalable), and curated, filtered collections. We cover data curation in detail in Section 22.5; for now, the key point is that the data is a set of demonstrations of good behaviour.

22.3

Here is good news for the beginner: the SFT training objective is the SAME next-token cross-entropy loss you already know from Chapter 15. We are not learning a new kind of training. We take the instruction-response pairs, format them into sequences, and train the model to predict each next token — exactly as in pretraining. The model architecture does not change at all.

There is just ONE important twist, and understanding it is essential: we usually do NOT want the model to be penalized for failing to predict the INSTRUCTION tokens — only the RESPONSE tokens. We mask out the loss on the prompt. Let us build up to why.

Formatting a Training Example

First, we concatenate the instruction and response into a single sequence, with special markers showing where each begins (more on these markers in Section 22.4). Schematically, one training example becomes:

textOne SFT training sequence
[INSTRUCTION tokens]  [RESPONSE tokens]  [END]

Example (simplified):
  <user> What is 2+2? <assistant> 2+2 equals 4. <end>
  \_____ prompt _____/  \___ response ___/

Why Mask the Prompt?

During pretraining, the model predicts EVERY token. But in SFT, the instruction is GIVEN — it is the input, not something the model should learn to generate. If we trained the model to predict the instruction tokens too, we would waste capacity teaching it to generate instructions (which we never want it to do) instead of focusing on generating good responses. So we apply a LOSS MASK: we compute the loss only on the response tokens and ignore the prompt tokens.

SFT Note: Loss Masking, Concretely
For the sequence ' What is 2+2? 2+2 equals 4. ', we set the target labels for the prompt tokens (' What is 2+2? ') to a special 'ignore' value (-100 in PyTorch), so they contribute zero to the loss. Only the response tokens ('2+2 equals 4. ') produce a training signal.
Beginners often forget this step and train on the full sequence. The model still works, but it wastes effort learning to predict instructions, which slightly hurts quality. Masking the prompt focuses all the learning on what matters: producing good responses.
PythonPreparing an SFT example with loss masking
import torch

def prepare_sft_example(prompt, response, tokenizer):
    """Tokenize a prompt+response pair and build masked labels."""
    prompt_ids   = tokenizer.encode(prompt)        # the instruction
    response_ids = tokenizer.encode(response)      # the target answer

    # The model sees the full sequence: prompt followed by response
    input_ids = prompt_ids + response_ids

    # Labels: -100 means 'ignore in the loss'. We mask the prompt so
    # the model is only trained to predict the RESPONSE tokens.
    labels = ([-100] * len(prompt_ids)) + response_ids

    return torch.tensor(input_ids), torch.tensor(labels)

# Example:
#   prompt   = '<user> What is 2+2? <assistant>'   -> labels all -100
#   response = ' 2+2 equals 4. <end>'              -> labels = the token ids
# PyTorch's cross_entropy(ignore_index=-100) skips the -100 positions,
# so the loss comes only from the response. That's the whole twist.

The Loss

textSFT loss (masked next-token cross-entropy)
L = -(1/|R|) Σ   log P(xₜ | x_<ₜ)
          t ∈ R

R = the set of RESPONSE token positions only
# Identical to pretraining's loss, but summed over response tokens, not all tokens.
SFT Is Pretraining, Focused
Step back and appreciate the simplicity: SFT is just pretraining (next-token prediction) applied to a small, carefully-chosen dataset of demonstrations, with the loss focused on the responses. No new architecture, no new loss function, no reinforcement learning (that comes in Chapter 23). If you understood Chapter 15, you already understand 90% of SFT.
This simplicity is why SFT is the natural first post-training step. The hard parts of SFT are not the training — they are the DATA (what to train on) and the EFFICIENCY (how to train cheaply). The rest of this chapter is about those two things.
22.4

In Section 22.3 we used informal markers like '' and ''. Real models use precise, fixed formats called CHAT TEMPLATES, built from special tokens. Getting these exactly right is one of the most common stumbling blocks for beginners, so we will go slowly and carefully here.

Why Templates Exist

A conversation has structure: there are turns, and each turn has a ROLE — system (instructions about how to behave), user (the human), or assistant (the model). The model needs to know where each turn begins and ends, and who is speaking. Special tokens mark these boundaries. The model is TRAINED with these exact tokens, so at inference time you must use the SAME tokens, or the model will be confused.

System
Instructions about the assistant's behaviour and persona, usually given once at the start.
User
A message from the human — a question, instruction, or statement.
Assistant
The model's response. This is the part the model generates and is trained on.

A Concrete Template: ChatML

One widely-used format is ChatML, which wraps each turn in '<|im_start|>' and '<|im_end|>' tokens with the role name. Here is a full conversation in ChatML format:

PythonA conversation in ChatML format
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is the capital of France?<|im_end|>
<|im_start|>assistant
The capital of France is Paris.<|im_end|>

# Each turn: <|im_start|>{role}\n{content}<|im_end|>
# The model learns: after '<|im_start|>assistant\n', generate a response,
# then emit '<|im_end|>' to signal it is finished.
# Different model families use DIFFERENT templates -- LLaMA, Mistral, and
# Gemma each have their own. There is no universal standard.

Different Models, Different Templates

There is no single standard chat template — each model family invented its own. LLaMA-2 uses '[INST]' and '[/INST]' markers; LLaMA-3 uses a different header-based format; Mistral, Gemma, and others each differ. The content is the same conversation; only the wrapping tokens differ. This is why you must use the template that MATCHES the model you are fine-tuning or running.

⚠️
Pitfall: The #1 Beginner Mistake: Wrong or Hand-Built Templates
By far the most common SFT and inference bug is using the wrong chat template, or hand-constructing it with a small error (a missing newline, the wrong token, an extra space). The model was trained on an EXACT format; even a tiny deviation can degrade responses dramatically, because the model never saw that format during training.
The fix is simple: NEVER hand-build chat templates. Use the tokenizer's built-in apply_chat_template method, which knows the exact format for that model. This single habit prevents a whole class of frustrating, hard-to-diagnose quality problems.
PythonAlways use apply_chat_template
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')

messages = [
    {'role': 'system',    'content': 'You are helpful.'},
    {'role': 'user',      'content': 'Hello!'},
    {'role': 'assistant', 'content': 'Hi! How can I help?'},
    {'role': 'user',      'content': 'Tell me a joke.'},
]

# For TRAINING: tokenize the full conversation
input_ids = tok.apply_chat_template(messages, tokenize=True)

# For INFERENCE: add the generation prompt so the model continues
# as the assistant (adds the '<|im_start|>assistant' opener)
prompt = tok.apply_chat_template(messages, tokenize=False,
                                 add_generation_prompt=True)

# The tokenizer handles the exact special tokens, spacing, and newlines.
# You never need to remember the format -- the tokenizer knows it.
SFT Note: Multi-Turn Masking
In a multi-turn conversation, the SFT loss is computed only on the ASSISTANT turns — every user and system turn is masked out (labels = -100), just like the prompt in the single-turn case. The model learns to produce assistant responses given the preceding conversation, but is never trained to generate user messages.
Some training setups train on all assistant turns in a conversation at once (efficient); others train on only the final turn. Both are used; training on all assistant turns makes fuller use of each conversation.
22.5

If SFT is just training on demonstrations, then the demonstrations are everything. The single most important factor in SFT success is DATA QUALITY. And the most surprising, important finding in this area is that you need far LESS data than you might expect — but it must be GOOD.

The Superficial Alignment Hypothesis

The LIMA paper (Zhou et al., 2023) made a striking claim, backed by experiment: a model fine-tuned on just 1,000 carefully-curated, high-quality examples could rival models trained on tens of thousands. They called the explanation the 'Superficial Alignment Hypothesis': a model's knowledge and capabilities are learned almost entirely during pretraining; SFT mostly teaches the model which of its existing abilities and FORMATS to use when responding.

What This Means for You
If SFT is teaching FORMAT and STYLE rather than knowledge, then quality and diversity matter far more than quantity. 1,000 excellent, diverse demonstrations teach the helpful-response format better than 100,000 noisy, repetitive ones. The mediocre examples actively HURT, by teaching the model to imitate mediocre responses.
This is liberating for beginners: you do not need a massive dataset to get a good instruction-following model. You need a modest set of genuinely good examples. Curating 1,000 great examples is a tractable task; scraping 1,000,000 mediocre ones is both harder and worse.

What Makes SFT Data Good?

Quality dimensionWhat to aim for
CorrectnessResponses are accurate, complete, and genuinely helpful
DiversityMany different tasks, topics, formats, and difficulty levels
Format consistencyResponses follow a consistent, clean style and structure
Appropriate lengthDetailed enough to help, not padded with fluff
Helpful tonePolite, direct, well-organized — the behaviour you want
No contradictionsThe dataset does not teach conflicting behaviours

Sources of SFT Data

There are three main ways to obtain instruction data, each with trade-offs:

Human-written: experts or trained annotators write demonstrations. Highest quality and most controllable, but slow and expensive. This is how the first InstructGPT data was made.
Model-distilled: prompt a strong existing model to generate instruction-response pairs (e.g. Self-Instruct, Alpaca). Cheap and scalable, but inherits the teacher model's quality and biases — and may have licensing restrictions.
Curated/filtered collections: assemble and heavily filter existing open datasets, keeping only the best examples. The LIMA approach. Balances cost and quality.
Preference Pair
Prompt: Explain what a variable is in programming.
Chosen: A variable is a named container that stores a value your program can use and change later. Think of it like a labeled box: you put something in (assign a value), refer to it by its label (the name), and can swap the contents anytime. For example, in age = 25, the variable age holds the value 25.
Rejected: A variable is a thing that holds data. variables are important in programming. you use them a lot.

The two responses above answer the same instruction, but the left (chosen) one is the kind of demonstration you want in your SFT set: accurate, complete, well-structured, with a concrete example. The right (rejected) one is vague, lowercase, and unhelpful. Training on responses like the right one teaches the model to be vague and unhelpful. This is why curation — actively keeping the good and discarding the bad — is the heart of SFT data work.

⚠️
Garbage In, Garbage Out — Amplified
Because SFT teaches the model to IMITATE its training responses, any flaw in your data becomes a flaw in your model. If your demonstrations are verbose, the model becomes verbose. If they hedge excessively, the model hedges. If they contain a recurring formatting quirk, the model adopts it. The model faithfully learns whatever behaviour you demonstrate — good or bad.
This makes SFT data curation a high-leverage, high-responsibility task. Read your data. Spot-check responses. Remove the bad ones. The few hours spent cleaning a dataset pay off more than almost any hyperparameter tuning.
22.6

We now have all the pieces: the data (instruction-response pairs), the format (chat template), and the objective (masked next-token loss). Let us assemble them into a complete, working SFT training loop. We will build it step by step so every line is clear.

Step 1: Prepare the Dataset

PythonBuilding an SFT dataset with masking
import torch; from torch.utils.data import Dataset

class SFTDataset(Dataset):
    def __init__(self, conversations, tokenizer, max_len=2048):
        self.examples = []
        for conv in conversations:
            # Tokenize the full conversation with the chat template
            ids = tokenizer.apply_chat_template(conv, tokenize=True)
            labels = self._mask_non_assistant(conv, ids, tokenizer)
            self.examples.append((ids[:max_len], labels[:max_len]))

    def _mask_non_assistant(self, conv, ids, tok):
        """Set labels to -100 everywhere except assistant responses."""
        labels = list(ids)
        # (in practice: walk the conversation, find assistant spans,
        #  and set every NON-assistant token's label to -100)
        return labels

    def __len__(self): return len(self.examples)
    def __getitem__(self, i): return self.examples[i]

Step 2: The Training Loop

The training loop is almost identical to the pretraining loop of Chapter 15. The only differences are the masked labels and the much smaller dataset and learning rate. Notice how familiar this looks — SFT really is just focused fine-tuning.

PythonCode Lab: the full SFT training loop
import torch; import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

# 1. Load the PRETRAINED base model and its tokenizer
model = AutoModelForCausalLM.from_pretrained('base-model').cuda()
tok   = AutoTokenizer.from_pretrained('base-model')

# 2. Build the SFT dataset and loader
dataset = SFTDataset(conversations, tok)
loader  = torch.utils.data.DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=pad_collate)

# 3. SFT uses a SMALL learning rate -- we are gently nudging, not
#    retraining. Too high and we destroy pretrained knowledge.
opt = torch.optim.AdamW(model.parameters(), lr=2e-5, weight_decay=0.0)

# 4. Train for just 1-3 epochs -- more causes overfitting
for epoch in range(3):
    for input_ids, labels in loader:
        input_ids, labels = input_ids.cuda(), labels.cuda()
        with torch.autocast('cuda', dtype=torch.bfloat16):
            logits = model(input_ids).logits
            # Shift for next-token prediction, then masked cross-entropy
            loss = F.cross_entropy(
                logits[:, :-1].reshape(-1, logits.size(-1)),
                labels[:, 1:].reshape(-1),
                ignore_index=-100,           # <- the masking happens here
            )
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step(); opt.zero_grad()

# That's it. The model is now an instruction-follower.
SFT Note: SFT Hyperparameters Differ From Pretraining
Two things change from pretraining. First, the LEARNING RATE is much smaller (1e-5 to 2e-5, versus ~3e-4 for pretraining): we are making small adjustments to a capable model, not training from scratch, and a large rate would erase the pretrained knowledge. Second, we train for very few EPOCHS (1–3): SFT datasets are small, and training too long causes the model to overfit and lose generality.
These conservative settings reflect SFT's nature: it is a gentle nudge, not a heavy retraining. The pretrained model already knows almost everything; we are just teaching it the helpful-response format.
22.7

The training loop in Section 22.6 does FULL fine-tuning: it updates every one of the model's billions of parameters. This works, but it has serious practical problems that motivate the parameter-efficient methods in the rest of the chapter.

Problem 1: Memory Cost

Recall the memory budget from Chapter 18: full fine-tuning with AdamW needs about 16 bytes per parameter (the model, gradients, and two optimizer moments, plus the fp32 master copy). For a 7-billion-parameter model, that is about 112GB — more than a single 80GB GPU can hold. To fully fine-tune even a modest model, you need multiple expensive GPUs and the distributed-training machinery of Chapter 18.

textFull fine-tuning memory (per the Ch.18 rule)
Memory ≈ 16 bytes × N parameters  (+ activations)

7B model:  16 × 7e9   = 112 GB   (exceeds one 80GB GPU)
70B model: 16 × 70e9  = 1,120 GB  (needs many GPUs)

Most people cannot afford this just to fine-tune.

Problem 2: Storage Cost

Full fine-tuning produces a complete copy of the model for each task. If you fine-tune a 7B model for ten different tasks, you have ten 14GB checkpoints — 140GB of storage, and you must load a whole new model to switch tasks. This does not scale when you want many specialized variants.

Problem 3: Catastrophic Forgetting

When you update all parameters on a narrow SFT dataset, the model can 'forget' some of its pretrained knowledge — a phenomenon called catastrophic forgetting. The aggressive updates that teach the new behaviour can overwrite capabilities the model had before. Conservative learning rates help, but the risk grows with the amount of fine-tuning.

Catastrophic forgetting
The tendency of a neural network to lose previously-learned capabilities when trained on new data, because the new gradients overwrite the weights that encoded the old knowledge.
Intuition: The Core Insight Behind Parameter-Efficient Fine-Tuning
Here is the key realization. Full fine-tuning changes ALL the weights, but the CHANGE needed to teach a new behaviour is usually small and structured — the fine-tuned weights are close to the original ones. What if, instead of storing a whole new model, we stored only the small CHANGE? And what if that change has a simple, low-dimensional structure we can represent compactly?
This is exactly the idea behind Parameter-Efficient Fine-Tuning (PEFT). We freeze the original model and learn only a small number of NEW parameters that capture the needed change. The next sections build up the most important PEFT method, LoRA, from this insight.
22.8

LoRA (Low-Rank Adaptation; Hu et al., 2021) is the most important parameter-efficient fine-tuning method. It is elegant, effective, and widely used. We met it briefly in Chapter 20; here we develop it carefully and from the ground up, because understanding it well is essential for modern fine-tuning.

The Key Observation

Full fine-tuning learns an update ΔW to each weight matrix W, giving a new weight W + ΔW. The update ΔW has the same shape as W — for a 4096×4096 matrix, that is ~16.8 million numbers per matrix. LoRA's insight: this update ΔW tends to have LOW RANK. That means it can be FACTORED into the product of two much smaller matrices, capturing almost the same change with far fewer numbers.

textThe LoRA factorization
Full fine-tuning:  W_new = W + ΔW     (ΔW is d×d, large)

LoRA:              ΔW = B A            (low-rank factorization)
    B is d×r,  A is r×d,  with rank r ≪ d

    W_new = W + B A
    Trainable params: 2·d·r  instead of  d²

Let us make the savings concrete. For a 4096×4096 weight matrix (d = 4096) with LoRA rank r = 8: full fine-tuning trains 4096×4096 ≈ 16.8 million parameters, while LoRA trains 2×4096×8 = 65,536 parameters — a 256× reduction. And because the original W is FROZEN, we only need to store and optimize those 65k LoRA parameters per matrix, not the whole model.

How LoRA Works in the Forward Pass

During the forward pass, instead of computing y = Wx, LoRA computes y = Wx + B(Ax). The frozen W does its usual work, and the small low-rank path B(Ax) adds the learned adjustment. Crucially, only A and B receive gradients; W never changes. Here is the structure:

Arch Stack: LoRA: a frozen weight with a small trainable side-path

output y = Wx + BAx(d,)
+ add the two paths
B (d×r, trainable)up-project r→d
A (r×d, trainable)down-project d→r
W (d×d, FROZEN)the pretrained weight
input x(d,)

The Two Hyperparameters: Rank r and Alpha

LoRA has two main knobs that beginners must understand:

Rank r: the inner dimension of the factorization — how much 'capacity' the update has. Small r (4–8) is cheaper and often enough; larger r (16–64) gives more capacity for harder adaptations. It is the main quality/cost dial.
Alpha (α): a scaling factor applied to the LoRA path. The update is scaled by α/r, so the effective contribution of the low-rank path is controlled independently of r. A common convention is to set α = 2r or α = r.
textLoRA with scaling
y = W x  +  (α / r) · B A x

α/r is the scaling factor:
    larger α  → the LoRA update has more influence
    dividing by r keeps the scale stable as you change rank

Initialization: Starting From the Original Model

A subtle but important detail: LoRA initializes B to ZERO and A to small random values. This means that at the start of training, BA = 0, so the model behaves EXACTLY like the original pretrained model. Training then gradually grows the update from zero. This is why LoRA fine-tuning is stable: it begins as the unchanged base model and departs from it smoothly.

PythonLoRA layer from scratch
import torch; import torch.nn as nn

class LoRALinear(nn.Module):
    """Wraps a frozen linear layer with a trainable low-rank update."""
    def __init__(self, base_linear, r=8, alpha=16):
        super().__init__()
        self.base = base_linear
        for p in self.base.parameters():
            p.requires_grad = False            # FREEZE the original weight

        d_in  = base_linear.in_features
        d_out = base_linear.out_features
        self.r = r
        self.scaling = alpha / r                  # the alpha/r scaling

        # A: small random (down-projection),  B: ZERO (up-projection)
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
        self.B = nn.Parameter(torch.zeros(d_out, r))    # B=0 -> starts as base model

    def forward(self, x):
        # frozen base path + scaled low-rank update
        return self.base(x) + self.scaling * (x @ self.A.T @ self.B.T)

# Only A and B have requires_grad=True, so the optimizer updates
# just 2*d*r parameters per layer. The base model is untouched.
# At init B=0, so output = base(x) exactly -- training departs smoothly.

Merging: Zero Inference Cost

A wonderful property of LoRA: after training, you can MERGE the update back into the original weights by computing W + (α/r)BA once, producing a normal weight matrix. The merged model has the EXACT same architecture and inference speed as the original — the LoRA path adds zero inference cost. You can also keep the adapter separate and swap different LoRA adapters in and out to switch tasks instantly, since each adapter is just a few megabytes.

Pref Note: LoRA Adapters Are Tiny and Swappable
A LoRA adapter for a 7B model is typically just a few megabytes — versus 14GB for a full fine-tuned copy. You can store hundreds of task-specific adapters cheaply and load the right one on demand, all sharing a single frozen base model in memory. This makes serving many specialized variants practical.
This is why LoRA transformed fine-tuning from an industrial activity into something a hobbyist can do on a single GPU, and why platforms host thousands of community-made LoRA adapters for popular base models.
22.9

LoRA dramatically reduces the TRAINABLE parameters, but you still need to hold the full frozen base model in memory for the forward and backward passes. For a 70B model in bf16, that frozen model alone is ~140GB — still too large for a single GPU. QLoRA (Dettmers et al., 2023) solves this with a clever combination: store the frozen base model in 4-BIT precision, while training the LoRA adapters in higher precision.

The QLoRA Idea

Quantization (covered fully in Chapter 27) stores numbers with fewer bits. QLoRA quantizes the frozen base model to just 4 bits per parameter — a 4× reduction from bf16's 16 bits. Since the base model is frozen and never updated, the precision loss from quantizing it is tolerable. The LoRA adapters, which ARE trained, stay in higher precision (bf16) so their gradients are accurate. The result: a 70B model fine-tunable on a single 48GB GPU.

textQLoRA memory: base model in 4-bit
Frozen base (4-bit):   0.5 bytes/param
    70B model: 0.5 × 70e9 = 35 GB   (fits on one 48GB GPU!)

LoRA adapters (bf16):  tiny (a few hundred MB)
Optimizer state:       only for the small adapters, not the base

vs full fine-tuning: ~1,120 GB. QLoRA: ~40 GB total.

Three Innovations in QLoRA

InnovationWhat it does
4-bit NormalFloat (NF4)A 4-bit data type optimized for the bell-curve distribution of neural network weights, more accurate than naive 4-bit
Double quantizationQuantizes the quantization constants too, saving a little more memory
Paged optimizersOffloads optimizer state to CPU memory during spikes, preventing out-of-memory crashes
PythonQLoRA fine-tuning with Hugging Face PEFT
import torch; from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# 1. Configure 4-bit (NF4) quantization for the FROZEN base model
bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',            # NormalFloat-4
    bnb_4bit_compute_dtype=torch.bfloat16,       # compute in bf16
    bnb_4bit_use_double_quant=True,       # double quantization
)
model = AutoModelForCausalLM.from_pretrained('big-model', quantization_config=bnb)

# 2. Configure LoRA: which layers, what rank
lora = LoraConfig(
    r=16,                                # rank
    lora_alpha=32,                       # alpha (= 2r here)
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'],  # attention projs
    lora_dropout=0.05,
    task_type='CAUSAL_LM',
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 8.4M || all params: 7,000M || trainable%: 0.12%

# 3. Train EXACTLY like Section 22.6 -- the loop is unchanged.
# Only 0.12% of parameters are trained, on a 4-bit base, on one GPU.
ML Connection: QLoRA Democratized Fine-Tuning
Before QLoRA, fine-tuning a large model required a cluster. After QLoRA, a single consumer or prosumer GPU could fine-tune a 65B model. This collapsed the cost of customization by orders of magnitude and unleashed a wave of community fine-tunes, specialized assistants, and research — much of the open-model ecosystem runs on QLoRA.
The lesson generalizes: efficiency techniques do not just save money — they change WHO can participate. By bringing large-model fine-tuning within reach of individuals, QLoRA broadened the field far beyond the well-resourced labs.
22.10

LoRA is the most popular PEFT method, but it is part of a broader family. Understanding the alternatives helps you appreciate why LoRA won and when another method might fit better. All share the same goal: adapt a frozen model by training only a small number of new parameters.

MethodHow it adaptsNotes
LoRALow-rank update added to weightsMost popular; mergeable, no inference cost
QLoRALoRA + 4-bit frozen baseFits big models on one GPU
AdaptersSmall new layers inserted in blocksAdds inference latency
Prefix tuningTrainable 'virtual tokens' prependedPrepends to keys/values
Prompt tuningTrainable soft-prompt embeddingsSimplest; weakest for hard tasks
IA3Learned scaling vectors for activationsVery few parameters
DoRADecomposes weight into magnitude+directionLoRA refinement, often better

Adapters: The Original PEFT

Adapters (Houlsby et al., 2019) were the original parameter-efficient method: small bottleneck layers (down-project, nonlinearity, up-project) inserted inside each Transformer block, with the rest of the model frozen. They work well but, unlike LoRA, add extra layers that increase inference latency and cannot be merged away. LoRA's mergeability — zero inference cost — is a major reason it became dominant.

Prompt and Prefix Tuning: Adapting the Input

Prompt tuning and prefix tuning take a different approach: instead of modifying the weights, they prepend trainable 'soft' vectors to the input (prompt tuning) or to the attention keys and values (prefix tuning). The model weights are entirely frozen; only these few prepended vectors are learned. They are extremely parameter-efficient but generally less powerful than LoRA for difficult adaptations.

Compare: Weight-based vs Input-based PEFT
Weight-based (LoRA, adapters, DoRA): modify or augment the model's weights. More expressive, handles harder adaptations, LoRA is mergeable. The mainstream choice.
Input-based (prompt/prefix tuning): prepend trainable vectors, leaving weights frozen. Fewer parameters still, fully reversible, but weaker for complex behaviour changes. Useful when you want many lightweight task switches on a fixed model.
SFT Note: Which Should You Use?
For almost all instruction-tuning and SFT work today, the answer is LoRA (or QLoRA if memory is tight, or DoRA for a small quality bump). It hits the best balance of expressiveness, efficiency, mergeability, and ecosystem support. Reach for the alternatives only for specific needs: prompt tuning for ultra-lightweight task switching, IA3 for the absolute minimum parameter count.
When in doubt: start with LoRA at rank 8–16 on the attention projections, and only explore alternatives if you hit a specific limitation. The Hugging Face PEFT library implements all of these behind a consistent interface, so experimenting is easy.
22.11

Let us consolidate everything into a practical recipe you could follow to fine-tune a base model into an instruction-following assistant. This section is the hands-on synthesis of the chapter.

The Recipe

Pipeline Flow: A complete SFT workflow

1Choose basePick a pretrained base model of the right size for your compute
2Curate dataAssemble a few thousand high-quality, diverse instruction-response pairs
3FormatApply the model's chat template; mask non-assistant tokens
4Pick PEFTLoRA (rank 8–16) on attention projections; QLoRA if memory-bound
5Train1–3 epochs, lr ~1e-5 to 2e-4 (higher for LoRA), bf16, grad-clip 1.0
6EvaluateGenerate from held-out prompts; check helpfulness and format
7IterateFix data issues, adjust rank/lr, repeat — data fixes beat hyperparameter tweaks

Hyperparameters That Matter

HyperparameterTypical valueGuidance
Learning rate (full FT)1e-5 to 2e-5Small — avoid erasing pretrained knowledge
Learning rate (LoRA)1e-4 to 3e-4Higher — only adapters train, base is safe
Epochs1 to 3More overfits; watch validation loss
LoRA rank r8 to 16Higher for harder adaptations
LoRA alphar to 2rCommon convention; scales the update
Batch sizeas large as fitsUse gradient accumulation if needed
Warmup~3% of stepsBrief warmup stabilizes the start
Max sequence length2k to 8kCover your longest conversations
⚠️
LoRA Uses a Higher Learning Rate Than Full Fine-Tuning
A common beginner confusion: LoRA typically uses a learning rate 10–100× HIGHER than full fine-tuning (e.g. 2e-4 vs 2e-5). This is not a contradiction. With full fine-tuning, a high rate would damage the precious pretrained weights. But with LoRA, the base weights are FROZEN and safe — only the small, freshly-initialized adapters train, and they need a higher rate to learn quickly from their zero start.
So the rule 'use a small learning rate for fine-tuning' applies to FULL fine-tuning of pretrained weights. LoRA adapters are new parameters and follow different, higher-rate dynamics. Mixing these up is a frequent source of poor results.
22.12

After fine-tuning, how do you know it worked? And when results disappoint, how do you diagnose the cause? SFT has characteristic success signals and failure modes, and recognizing them turns debugging from guesswork into method.

Signs It Worked

The model answers instructions directly instead of continuing them (the base-model behaviour from Section 22.1 is gone).
Responses follow a consistent, clean format matching your training data.
The model stops appropriately (emits the end token) instead of rambling on.
It generalizes to instruction TYPES not seen in training, not just memorized examples.

Common Failure Modes and Their Causes

SymptomLikely causeFix
Still completes, doesn't answerToo little SFT / lr too lowMore epochs or higher lr
Repeats or never stopsEnd token not learned / not maskedCheck template & EOS in data
Robotic, overfit responsesTrained too long / too little dataFewer epochs, more diverse data
Lost knowledge/capabilityCatastrophic forgettingLower lr, use LoRA, less data
Adopts a bad quirkThe quirk is in the training dataAudit and clean the data
Ignores system promptSystem turns not in training formatInclude system turns in SFT data
Garbled outputWrong chat template at inferenceUse apply_chat_template

How to Evaluate

Evaluation of an instruction-tuned model is harder than perplexity (Chapter 21), because we care about open-ended helpfulness, not just prediction. The main approaches: hold out some instruction-response pairs and check the model's responses qualitatively; use an automated 'LLM-as-judge' where a strong model rates response quality; and run instruction-following benchmarks (like IFEval or AlpacaEval) that score how well the model obeys instructions. Reading actual generated responses remains the single most informative check.

SFT Note: The Overfit-Quickly Trap
SFT datasets are small, so it is easy to overfit — train for too many epochs and the model memorizes the training responses, parroting them robotically and losing the ability to generalize to new instructions. The telltale sign is responses that are weirdly rigid or that quote training examples nearly verbatim.
The defense: hold out a validation set, watch its loss, and stop when it stops improving (usually after just 1–3 epochs). Resist the urge to train longer — with SFT, less is often more. The model already has the capabilities; you are just teaching it the format, which does not take long.
22.13

SFT Quick-Reference

ConceptKey ideaRemember
Base vs assistantBase completes; SFT teaches it to answerSFT elicits, doesn't teach knowledge
Instruction tuningTrain on (instruction, response) pairsDiversity drives generalization
SFT objectiveMasked next-token cross-entropyMask the prompt; loss on response
Chat templatesExact special-token format per modelUse apply_chat_template
Data qualityFew great examples beat many badSuperficial alignment hypothesis
LoRALow-rank trainable update, base frozen2dr params; mergeable; α/r scaling
QLoRALoRA + 4-bit frozen baseBig models on one GPU
HyperparametersLow lr (full) / high lr (LoRA), 1–3 epochsDon't overfit small data

Exercises

Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.

Exercise 1: Pen & Paper
Explain why a base model 'continues' rather than 'answers'. Give a prompt where the base and instruction-tuned behaviours would clearly differ.
Exercise 2: Pen & Paper
State the superficial alignment hypothesis. What does it imply about how much SFT data you need and why quality matters more than quantity?
Exercise 3: Pen & Paper
Why do we mask the prompt tokens in the SFT loss? What would go wrong if we trained on the full sequence including the instruction?
Exercise 4: Pen & Paper
Write the SFT loss formula and explain how it differs from the pretraining loss of Chapter 15. What stays the same?
Exercise 5: Pen & Paper
Explain why using the wrong chat template degrades a model's responses. Why should you never hand-build the template?
Exercise 6: Pen & Paper
List the three problems with full fine-tuning (memory, storage, forgetting) and explain how PEFT addresses each.
Exercise 7: Derive
For a d×d weight matrix and LoRA rank r, derive the trainable-parameter count 2dr. For d=4096, r=8, compute the reduction factor vs full fine-tuning.
Exercise 8: Pen & Paper
Explain why LoRA initializes B=0 and A small-random. What does the model compute at the very start of training, and why is this desirable?
Exercise 9: Pen & Paper
Explain the LoRA alpha/r scaling. Why divide by r? What happens to the update's influence if you double alpha?
Exercise 10: Pen & Paper
Why can LoRA use a much higher learning rate than full fine-tuning? Connect your answer to which parameters are frozen.
Exercise 11: Pen & Paper
Explain how QLoRA fits a 70B model on one GPU. Why is it acceptable to quantize the frozen base to 4-bit but keep the adapters in bf16?
Exercise 12: Code
Implement prepare_sft_example with prompt masking (labels=-100 on the prompt). Verify that only response tokens contribute to the loss.
Exercise 13: Code
Take a base model and a handful of instruction-response pairs. Show the base model's 'continuation' behaviour, then fine-tune and show it now answers.
Exercise 14: Code
Use a tokenizer's apply_chat_template on a multi-turn conversation. Print the exact tokens, then demonstrate how a hand-built (wrong) template differs.
Exercise 15: Code
Implement multi-turn loss masking: given a conversation, set labels to -100 for all system/user tokens and keep only assistant-response tokens.
Exercise 16: Code Lab
Build the complete SFT training loop from Section 22.6 and fine-tune a small base model on a curated instruction set. Show before/after responses on held-out prompts.
Exercise 17: Code
Implement the LoRALinear layer from scratch. Verify that at initialization its output equals the frozen base layer's output, and that only A and B receive gradients.
Exercise 18: Code
Wrap a small model's attention projections with your LoRA layer and fine-tune. Confirm with print_trainable_parameters that <1% of parameters train.
Exercise 19: Code Lab
Fine-tune the same model with (a) full fine-tuning and (b) LoRA. Compare peak memory, training speed, final quality, and checkpoint size.
Exercise 20: Code
Demonstrate LoRA merging: after training, merge BA into the base weights and verify the merged model produces identical outputs to the unmerged LoRA model.
Exercise 21: Code Lab
Set up QLoRA with a 4-bit quantized base using bitsandbytes and PEFT. Fine-tune a model that would not fit in bf16 on your GPU, and report the memory savings.
Exercise 22: Code (Challenge)
Build a complete SFT pipeline: curate ~500 diverse high-quality instruction-response pairs (write or filter them yourself), fine-tune a base model with LoRA, evaluate with held-out prompts and an LLM-as-judge, then deliberately add 200 low-quality examples and show how the model's quality degrades — demonstrating the superficial alignment hypothesis in practice.

Further reading: “Training language models to follow instructions with human feedback” (Ouyang et al., 2022, InstructGPT) for the original SFT+RLHF recipe. “LIMA: Less Is More for Alignment” (Zhou et al., 2023) for the superficial alignment hypothesis. “Finetuned Language Models Are Zero-Shot Learners” (Wei et al., 2021, FLAN). “LoRA: Low-Rank Adaptation of Large Language Models” (Hu et al., 2021) and “QLoRA: Efficient Finetuning of Quantized LLMs” (Dettmers et al., 2023). “Parameter-Efficient Transfer Learning for NLP” (Houlsby et al., 2019, adapters). The Hugging Face PEFT and TRL library documentation for hands-on tooling.


Next → Chapter 23: Reinforcement Learning from Human Feedback

SFT taught the model to follow instructions by imitating good demonstrations — but imitation has limits. The model can only be as good as its demonstrations, and for many qualities (helpfulness, harmlessness, nuanced judgment) it is far easier for humans to COMPARE two responses than to WRITE the ideal one. Chapter 23 introduces Reinforcement Learning from Human Feedback (RLHF): instead of imitating demonstrations, we train a reward model on human PREFERENCES between responses, then use reinforcement learning to optimize the model against that reward. This is how SFT models become the polished, aligned assistants you actually interact with — and the preference pairs you saw in this chapter become the central training signal.

22 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →