Parameter-Efficient Fine-Tuning
We have spent the whole book building and aligning a model. Now we have to RUN it — serve it to users, fast and cheaply. This turns out to be a completely different engineering challenge from training, with its own bottlenecks, its own metrics, and its own bag of tricks. Part VI begins here, with the techniques that make a trained model practical to deploy.
The Cost Shift: Training Is One-Time, Inference Is Forever
Here is the key economic fact. Training a model is a ONE-TIME cost — expensive, but paid once. Inference is a RECURRING cost — paid for every single token the model ever generates, for every user, forever. A popular model serves billions of tokens per day. Over a model's deployed life, the total inference cost dwarfs the training cost. This is why inference optimization matters so much: shaving 20% off inference cost saves far more, in absolute terms, than almost any training optimization.
Two Goals in Tension: Latency and Throughput
Inference optimization juggles two goals that often conflict. LATENCY is how fast a single user gets their response — it matters for interactive use (a chatbot must feel responsive). THROUGHPUT is how many tokens you can generate across all users per second — it determines cost-efficiency (more throughput means serving more users per GPU). Many techniques trade one for the other, and the right balance depends on the application.
| Latency (single-user speed) | Throughput (total tokens/sec) |
|---|---|
| How fast ONE response arrives | How many tokens across ALL users |
| Matters for interactivity | Matters for cost-efficiency |
| Small batches are better | Large batches are better |
| A chatbot feels snappy | Serve more users per GPU |
| Optimize per-request | Optimize aggregate |
To optimize inference, you must understand that generating a response happens in TWO distinct phases with very different performance characteristics. Confusing them is a common beginner mistake; distinguishing them is the foundation for everything else in this chapter.
Phase 1: Prefill (Processing the Prompt)
When you send a prompt, the model first PROCESSES the entire prompt in one go — this is prefill. All the prompt's tokens are fed through the model together, in parallel, to produce the first output token and to populate the KV cache (Section 27.3). Because all prompt tokens are processed at once, prefill is COMPUTE-BOUND: it does a lot of matrix multiplication and keeps the GPU's compute units busy.
Phase 2: Decode (Generating Tokens One by One)
After prefill, the model GENERATES the response one token at a time — this is decode. Each step takes the single most recent token, runs it through the model, and produces the next token, repeating until done. Because each step processes only ONE token, decode is MEMORY-BANDWIDTH-BOUND: the tiny computation for one token is dwarfed by the time spent reading the model weights and KV cache from memory.
Tool Trace: The two phases of generating a response
| User | Sends prompt: 'Explain photosynthesis' (5 tokens) | → |
| Prefill | Process all 5 prompt tokens at once → first output token + KV cache | • |
| Decode | Generate token 2, attending to the cached prompt | • |
| Decode | Generate token 3, attending to all previous | • |
| Decode | ... continue one token at a time until <end> | • |
| User | Receives the full streamed response | ← |
| Property | Prefill | Decode |
|---|---|---|
| Processes | All prompt tokens at once | One token at a time |
| Bottleneck | Compute (matmul) | Memory bandwidth |
| GPU utilization | High | Low (underutilized) |
| Parallelism | Across prompt tokens | None (sequential) |
| Determines | Time to first token | Time per output token |
| Optimized by | Chunking, fusion | Batching, smaller weights |
We met the KV cache in Chapters 13 and 19. It is so central to inference that we revisit it here as the object around which all of inference optimization revolves. Understanding it deeply makes the rest of the chapter click into place.
Why the Cache Exists
Recall how attention works: each new token attends to ALL previous tokens, using their keys (K) and values (V). Without a cache, generating each new token would require recomputing the keys and values for every previous token — enormously wasteful, since those don't change. The KV CACHE stores the keys and values of all previous tokens, so each new token only computes ITS OWN K and V and reads the rest from the cache. This turns generation from O(T²) recomputation into O(T) with caching — it is what makes generation practical.
The Problem: The Cache Is Huge
The KV cache solves a compute problem but creates a MEMORY problem. It grows with every token generated, and at long context lengths and large batch sizes it becomes the dominant consumer of GPU memory — often larger than the model weights themselves. Recall the formula from Chapter 19:
cache = 2 · layers · kv_heads · head_dim · seq_len · batch · bytes
# the 2 is for Keys AND Values
Example: 13B model (40 layers, 40 heads, 128 dim), 4k ctx, batch 16, fp16:
2 × 40 × 40 × 128 × 4096 × 16 × 2 ≈ 86 GB86GB of KV cache for one batch — more than a single GPU's memory. This is why so much of inference optimization targets the cache: shrinking it (GQA from Chapter 19, quantizing it), managing its memory efficiently (PagedAttention, Section 27.7), and reusing it across requests (prefix caching, Section 27.10). The KV cache is the resource that everything fights over.
You cannot optimize what you cannot measure. Inference has a specific vocabulary of metrics, and using the right ones is essential. Let us define them carefully, because they map directly onto the prefill/decode phases from Section 27.2.
| Metric | Meaning | Driven by |
|---|---|---|
| TTFT | Time To First Token — prompt sent to first token out | Prefill speed |
| TPOT | Time Per Output Token — gap between successive tokens | Decode speed |
| Latency | Total time for the full response | TTFT + TPOT × tokens |
| Throughput | Total output tokens/sec across all requests | Batching, efficiency |
| Goodput | Throughput meeting latency targets | Both, balanced |
How the Metrics Connect
These metrics fit together simply. The total time a user waits for a complete response is roughly the time to the first token (TTFT, set by prefill) plus the time per subsequent token (TPOT, set by decode) times the number of tokens generated. For an interactive chatbot, a low TTFT makes it feel responsive (the answer starts quickly), and a low TPOT makes it read smoothly (tokens stream fast enough to read).
latency ≈ TTFT + TPOT × (output tokens)
TTFT: how long until the FIRST token appears (prefill)
TPOT: time between each token after that (decode)
# A chatbot wants low TTFT (feels responsive) AND low TPOT (reads smoothly).The single most impactful inference optimization is QUANTIZATION: representing the model's weights (and sometimes activations and the KV cache) with fewer bits. A model trained in 16-bit precision can often be run in 8-bit or even 4-bit with little quality loss — halving or quartering its memory footprint and, because decode is memory-bound (Section 27.2), speeding it up substantially.
Why Quantization Helps So Much
Recall that decode is bottlenecked by READING the weights from memory, not by computing with them. If you store the weights in 4 bits instead of 16, there is 4× LESS data to read per token — so decode gets up to 4× faster, AND the model takes 4× less memory (leaving more room for KV cache and bigger batches). Quantization attacks the exact bottleneck of inference.
| Precision | Bits/weight | Memory (7B model) | Quality |
|---|---|---|---|
| fp16 / bf16 | 16 | ~14 GB | Baseline (full) |
| int8 | 8 | ~7 GB | Near-lossless |
| int4 | 4 | ~3.5 GB | Small loss, usually fine |
| int3 / int2 | 3 / 2 | ~2.6 / 1.8 GB | Noticeable degradation |
The Basic Idea: Map Floats to a Small Set of Integers
Quantization maps a range of floating-point values onto a small set of integers. Instead of storing each weight as a 16-bit float, you store a scale factor per group of weights and represent each weight as a small integer that, multiplied by the scale, approximates the original. The art is choosing the mapping so that the approximation loses as little quality as possible.
For a group of weights w with max absolute value M:
scale = M / (2^(bits-1) - 1)
q = round(w / scale) # small integer
w ≈ q × scale # dequantized approximation
# Store q (few bits) + scale (one per group). Reconstruct w when needed.import torch
def quantize_int8(weights, group_size=128):
"""Per-group symmetric int8 quantization."""
w = weights.reshape(-1, group_size) # groups of weights
# One scale per group, from the group's max magnitude
scale = w.abs().max(dim=1, keepdim=True).values / 127
q = torch.round(w / scale).clamp(-128, 127).to(torch.int8)
return q, scale
def dequantize(q, scale):
"""Reconstruct approximate fp weights for computation."""
return (q.float() * scale)
# int8: store q (1 byte) + a scale per 128 weights, vs 2 bytes/weight.
# ~2x smaller, near-lossless. int4 (4 bits) is ~4x smaller with care.
# The trick is choosing scales/grouping to minimize error -- that's what
# the methods in the next section (GPTQ, AWQ) do cleverly.Weight-Only vs Weight-and-Activation
Most LLM inference quantization is WEIGHT-ONLY: only the weights are stored in low precision, while computation happens in higher precision (the weights are dequantized on the fly). This is because weights are the memory bottleneck at decode time, and quantizing activations too is harder (activations have outliers that are sensitive to precision loss). Weight-only int4 is the sweet spot for most deployments.
The naive quantization of Section 27.5 works, but smarter methods lose far less quality at the same bit-width. Three names dominate practice — GPTQ, AWQ, and GGUF — and beginners are often confused about how they differ. Let us clear that up.
| Method | What it is | Best for |
|---|---|---|
| GPTQ | Post-training quantization minimizing layer-wise error | GPU inference, 4-bit |
| AWQ | Activation-aware: protects the most important weights | GPU inference, quality |
| GGUF | A file FORMAT for quantized models (llama.cpp) | CPU / local / Mac inference |
| bitsandbytes | On-the-fly int8/int4 (used in QLoRA) | Easy, training + inference |
GPTQ: Minimizing Quantization Error
GPTQ (Frantar et al., 2022) is a clever post-training quantization method. Instead of quantizing each weight independently, it quantizes them in a way that MINIMIZES the resulting error on a small calibration dataset, adjusting the remaining weights to compensate for the error introduced by each quantized weight. This careful, error-aware approach lets GPTQ quantize to 4 bits with minimal quality loss. It is one of the most widely-used 4-bit methods for GPU inference.
AWQ: Protecting the Important Weights
AWQ (Activation-aware Weight Quantization; Lin et al., 2023) is based on a key observation: not all weights are equally important. A small fraction of weights — those that multiply large activations — matter disproportionately for quality. AWQ identifies these important weights (by looking at activation magnitudes) and protects them from quantization error by scaling, while aggressively quantizing the rest. This activation-awareness often gives slightly better quality than GPTQ at the same bit-width.
GGUF: A Format, Not an Algorithm
GGUF is frequently confused with GPTQ and AWQ, but it is a different KIND of thing: it is a FILE FORMAT (used by llama.cpp), not a quantization algorithm. GGUF files store quantized models in a portable format optimized for running on CPUs, laptops, and Apple Silicon — enabling LLMs to run locally on consumer hardware without a GPU. GGUF supports many quantization levels (Q4_K_M, Q5_K_M, Q8_0, etc.), each a different bit-width and scheme. When you download a model to run locally on a Mac or PC, it is usually GGUF.
| You want to... | Use |
|---|---|
| Run on a GPU, 4-bit, good quality | GPTQ or AWQ |
| Run locally on a Mac/PC/CPU | GGUF (via llama.cpp / Ollama) |
| Fine-tune with quantization (QLoRA) | bitsandbytes (NF4) |
| Maximum throughput on GPU | AWQ/GPTQ with a serving engine (vLLM) |
We saw in Section 27.3 that the KV cache is a huge consumer of memory. But there is a second, subtler problem: traditional serving WASTES much of the memory it allocates to the cache. PagedAttention (Kwon et al., 2023), the technique behind the vLLM serving engine, solves this and was a major leap in serving efficiency.
The Waste Problem
Traditionally, when a request arrives, the server allocates one big CONTIGUOUS block of memory for its KV cache, sized for the maximum possible length. But most responses are far shorter than the maximum — so most of that reserved block sits empty and unusable by other requests. This is INTERNAL FRAGMENTATION: memory reserved but wasted. Studies found traditional serving wasted 60–80% of KV-cache memory this way.
How PagedAttention Works
# KV cache stored in small fixed-size blocks (like OS memory pages)
on each new token:
if the current block is full:
allocate a NEW block on demand (from a shared pool)
append the token's K,V to the current block
update the block table: logical position -> physical block
# Attention reads K,V through the block table (gather from blocks)
# No request over-reserves; freed blocks return to the shared poolThe payoff is large. By allocating cache memory in small blocks on demand, PagedAttention nearly eliminates the wasted memory, letting the server fit FAR more concurrent requests in the same GPU memory — vLLM reported up to 24× higher throughput than prior systems, largely from this. More requests fit, so larger batches are possible, so throughput soars.
Batching — processing many requests together — is how we get throughput, because it lets us read the weights ONCE and use them for MANY requests (recall the decode bottleneck from Section 27.2). But naive batching wastes the GPU. Continuous batching, the other half of vLLM's magic, fixes this and is one of the most important throughput optimizations.
The Problem with Static Batching
With STATIC (naive) batching, you collect a batch of requests, run them all together until ALL of them finish, then start the next batch. The problem: requests finish at different times — one needs 10 tokens, another needs 500. With static batching, the finished short requests sit idle, wasting their GPU slots, while the batch waits for the longest request. The GPU is underused, and new requests wait for the whole batch to clear.
| Static batching | Continuous batching |
|---|---|
| Batch runs until ALL finish | Each request leaves when IT finishes |
| Finished requests waste slots | Freed slots immediately reused |
| New requests wait for whole batch | New requests join mid-flight |
| GPU often underutilized | GPU stays full |
| Simple | Higher throughput, more complex |
How Continuous Batching Works
Continuous batching (also called in-flight or dynamic batching) operates at the level of individual DECODE STEPS. After every single token-generation step, the scheduler checks: did any request finish? If so, remove it and immediately admit a waiting request into the freed slot. The batch is continuously reshuffled, so the GPU is always working on a full batch of ACTIVE requests — no idle slots waiting for stragglers.
Tool Trace: Continuous batching: the scheduler reshuffles every step
| Scheduler | Batch has slots for 4 requests; all 4 active | • |
| Request A | Finishes at step 12 → leaves the batch | ← |
| Scheduler | Slot freed → admit waiting Request E immediately | → |
| Request E | Begins prefill and joins the active batch | → |
| Scheduler | GPU stays full; no slot idles waiting for stragglers | • |
Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) is a clever technique that speeds up generation with NO loss in output quality — the output is provably identical to normal decoding. It directly attacks the decode bottleneck (one token per expensive forward pass) by generating SEVERAL tokens per pass of the big model. It feels like magic the first time you see it.
The Core Idea: Draft, Then Verify
The insight: a small, fast 'draft' model can GUESS the next several tokens cheaply, and the big model can VERIFY all those guesses in a SINGLE forward pass (because verifying tokens in parallel is cheap — it is just one prefill-like pass). If the guesses are right, you got several tokens for the price of one big-model pass. If a guess is wrong, you discard it from that point and continue. Because the big model only ever ACCEPTS tokens it would have generated anyway, the output is identical to normal decoding.
Tool Trace: Speculative decoding: draft then verify
| Draft model | Cheaply guesses next 4 tokens: 'the cat sat on' | → |
| Big model | Verifies all 4 in ONE forward pass | • |
| Big model | Accepts 'the cat sat', rejects 'on' (would've said 'down') | • |
| Big model | Got 3 tokens + 1 correction = 4 tokens, 1 big pass | ← |
| Draft model | Drafts the next batch from 'down', repeat | → |
Why It Works and When It Helps
Speculative decoding helps because many tokens are EASY to predict — common words, completions of obvious phrases, code boilerplate. The small draft model gets these right most of the time, so the big model verifies several at once. The speedup depends on the ACCEPTANCE RATE: how often the draft's guesses are correct. A good draft model (well-matched to the big one) achieves high acceptance and 2–3× speedups.
Draft k tokens, big model verifies in 1 pass.
If α = average acceptance rate, expected accepted tokens per pass ≈
(1 - α^(k+1)) / (1 - α)
# High acceptance (easy tokens) → many tokens per big-model pass → big speedup.
# Output is IDENTICAL to normal decoding — quality is never traded away.# big model M, small draft model D, draft length k
repeat until done:
1. D cheaply generates k candidate tokens (autoregressively)
2. M verifies all k in ONE forward pass (parallel)
3. accept the longest prefix M agrees with
4. on the first disagreement, take M's token instead
5. continue drafting from there
# net: several tokens per big-model pass, same output as plain decodingBeyond the big techniques, a collection of smaller but valuable optimizations round out a modern inference stack. Each targets a specific inefficiency, and together they add up to substantial gains.
| Technique | What it does |
|---|---|
| Prefix caching | Reuse the KV cache for shared prompt prefixes across requests |
| Chunked prefill | Split long-prompt prefill into chunks, interleaved with decode |
| KV-cache quantization | Store the KV cache in int8/int4 to fit more / longer contexts |
| FlashAttention | IO-aware exact attention kernel (Chapter 20) |
| Tensor parallelism | Split the model across GPUs for models too big for one (Ch. 18) |
| CUDA graphs | Capture and replay the decode step to cut kernel-launch overhead |
| Multi-token prediction | Predict several tokens per step (Medusa-style) |
Prefix Caching: Don't Recompute Shared Prompts
Many requests share a common prefix — the same long system prompt, the same few-shot examples, the same document being asked about repeatedly. Prefix caching stores the KV cache for that shared prefix and REUSES it across all requests that share it, instead of recomputing the prefill every time. For applications with a large fixed system prompt, this can dramatically cut TTFT and cost. It builds directly on PagedAttention's block-sharing (Section 27.7).
Chunked Prefill: Smoothing Out Long Prompts
A very long prompt makes prefill take a long time, which delays the decode of other requests sharing the batch (and spikes TTFT). Chunked prefill splits a long prefill into smaller chunks and INTERLEAVES them with the decode steps of other requests. This keeps decode latency smooth for everyone, rather than letting one giant prompt monopolize the GPU. It balances the compute-heavy prefill against the latency-sensitive decode.
One more inference topic every practitioner must understand: HOW the next token is chosen from the model's output distribution. The model outputs a probability for every possible next token; the DECODING STRATEGY decides which one to actually emit. This affects output quality, diversity, and even speed, and the knobs (temperature, top-k, top-p) are ones you will tune constantly.
| Strategy | How it picks | Effect |
|---|---|---|
| Greedy | Always the highest-probability token | Deterministic, can be dull/repetitive |
| Temperature | Scales the distribution before sampling | Higher = more random/creative |
| Top-k | Sample from the k most likely tokens | Caps how wild it can get |
| Top-p (nucleus) | Sample from the smallest set summing to p | Adaptive cutoff |
| Beam search | Keep top-b sequences, pick the best | Better for closed-ended tasks |
Temperature: The Creativity Dial
Temperature is the most important sampling knob. It scales the logits before the softmax: a temperature below 1 SHARPENS the distribution (more confident, more deterministic), while above 1 FLATTENS it (more random, more diverse). Temperature 0 is equivalent to greedy decoding (always the top token). For factual tasks you want low temperature (precise, consistent); for creative writing you want higher (varied, surprising).
P(token) = softmax(logits / T)
T < 1: sharper → more deterministic, picks likely tokens (factual)
T = 1: the model's natural distribution
T > 1: flatter → more random, more diverse (creative)
T → 0: equivalent to greedy (always the top token)import torch; import torch.nn.functional as F
def sample_next_token(logits, temperature=0.8, top_k=50, top_p=0.95):
"""Pick the next token with temperature + top-k + top-p (nucleus)."""
# 1. Temperature: scale logits (lower = sharper)
logits = logits / temperature
# 2. Top-k: keep only the k highest-logit tokens
if top_k:
kth = torch.topk(logits, top_k).values[..., -1, None]
logits = torch.where(logits < kth, float('-inf'), logits)
# 3. Top-p (nucleus): keep the smallest set summing to >= p
probs = F.softmax(logits, dim=-1)
sorted_p, idx = torch.sort(probs, descending=True)
cumulative = torch.cumsum(sorted_p, dim=-1)
mask = cumulative - sorted_p > top_p # drop the tail beyond p
sorted_p[mask] = 0.0
sorted_p /= sorted_p.sum(dim=-1, keepdim=True)
# 4. Sample from the filtered distribution
choice = torch.multinomial(sorted_p, 1)
return idx.gather(-1, choice)
# Low temperature + greedy-ish: factual, consistent (Q&A, code).
# Higher temperature + top-p: creative, varied (stories, brainstorming).You will almost never implement these optimizations yourself. Instead, you use a SERVING ENGINE — software that bundles quantization, PagedAttention, continuous batching, speculative decoding, and the rest into a single high-performance system. Knowing the major engines and what each is for completes the practical picture.
| Engine | Strengths | Best for |
|---|---|---|
| vLLM | PagedAttention, continuous batching, high throughput | GPU serving at scale |
| TGI | Hugging Face's production server | Easy HF model deployment |
| TensorRT-LLM | NVIDIA's heavily-optimized engine | Max performance on NVIDIA |
| llama.cpp | CPU/Mac/local, GGUF, lightweight | Local & on-device inference |
| Ollama | Simple local model running (wraps llama.cpp) | Hobbyist / local dev |
| SGLang | Fast, with strong prefix caching | Complex prompting workloads |
The Request Lifecycle in a Serving Engine
Tying the chapter together, here is the journey of a request through a modern serving engine, touching the optimizations we covered:
Pipeline Flow: A request through an optimized serving engine
| 1 | Arrive | Request enters the queue; prefix cache checked for shared prompt |
| 2 | Schedule | Continuous-batching scheduler admits it into the active batch |
| 3 | Prefill | Prompt processed (chunked if long); KV cache allocated in pages |
| 4 | Decode | Tokens generated; speculative decoding drafts ahead; quantized weights |
| 5 | Stream | Tokens streamed back to the user as they are produced |
| 6 | Free | On completion, KV-cache pages return to the shared pool |
# pip install vllm
from vllm import LLM, SamplingParams
# Load a model (optionally a quantized one); the engine handles
# PagedAttention, continuous batching, etc. automatically.
llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct',
quantization='awq', # use an AWQ-quantized model
gpu_memory_utilization=0.9)
params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=512)
# Submit many prompts at once -- the engine batches them continuously
prompts = ['Explain photosynthesis.', 'Write a haiku.', ...]
outputs = llm.generate(prompts, params)
for out in outputs:
print(out.outputs[0].text)
# All the chapter's optimizations are active under the hood.
# You write a few lines; the engine delivers production-grade throughput.Inference Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Cost shift | Inference >> training over a model's life | Per-token cost paid forever |
| Prefill vs decode | Compute-bound vs memory-bound | Decode is the bottleneck |
| KV cache | Stores past keys/values | Dominates memory; the central object |
| Metrics | TTFT, TPOT, throughput | Match metric to use case |
| Quantization | Fewer bits per weight | int4 ~4x smaller, faster decode |
| GPTQ/AWQ/GGUF | Algorithms vs a file format | GPU algos vs local format |
| PagedAttention | Page the KV cache like OS memory | Eliminates fragmentation |
| Continuous batching | Reshuffle batch every step | Keeps the GPU full |
| Speculative decoding | Draft small, verify big | Faster, same output |
| Serving engines | Bundle all optimizations | Use vLLM / llama.cpp |
Exercises
Exercises 1–11 are pen-and-paper or derivations; 12–22 require code.
Further reading: “Efficient Memory Management for LLM Serving with PagedAttention” (Kwon et al., 2023, vLLM). “GPTQ” (Frantar et al., 2022) and “AWQ” (Lin et al., 2023) for quantization. “Fast Inference from Transformers via Speculative Decoding” (Leviathan et al., 2023) and “Accelerating LLM Decoding with Speculative Sampling” (Chen et al., 2023). “Medusa” (Cai et al., 2024) for multi-head speculation. “Orca” (Yu et al., 2022) for continuous batching. The vLLM, TensorRT-LLM, and llama.cpp documentation. “The Curious Case of Neural Text Degeneration” (Holtzman et al., 2019) for top-p sampling.
Next → Chapter 28: Tool Calling & Function Calling
You can now run a model fast and cheaply. But a model alone is limited to what is in its weights — it cannot look up today's weather, run a calculation reliably, query a database, or take actions in the world. Chapter 28 gives the model TOOLS: the ability to call functions, APIs, and external systems. We will see how a model is trained and prompted to decide WHEN to use a tool, format the call, and incorporate the result — the foundation of agents. The actor-to-actor message flow you saw in this chapter's diagrams becomes the agentic loop, where the model and its tools converse to accomplish tasks beyond what any single forward pass could.