Inference Optimization
Detailed solutions for the exercises in Chapter 27. Try solving them yourself before checking the answers.
Solution
Training happens once; inference runs for every user request, forever. A popular model serves billions of tokens daily, so cumulative inference compute quickly dwarfs the one-time training cost. Implication: optimization effort should focus heavily on inference (quantization, batching, caching), and model sizing should account for lifetime serving cost (Chapter 16's inference-aware optimum) — a 2× inference speedup can save more than the entire training bill.
Solution
Prefill processes the whole prompt in parallel — large matmuls with high arithmetic intensity, so it is compute-bound and determines time-to-first-token (TTFT). Decode generates one token at a time, each step reading all weights and the KV cache to produce a single token — little compute per byte moved, so it is memory-bandwidth-bound and determines time-per-output-token (TPOT). Prefill→TTFT, decode→TPOT.
Solution
Decode does tiny matmuls (one token) while loading huge weight matrices, so the compute units sit idle waiting on memory — low utilization. The two fixes: (1) make the weights SMALLER (quantization) so less must be read per token, easing the memory bottleneck; (2) BATCH many requests so the loaded weights are reused across many tokens, raising arithmetic intensity and filling the GPU. Both attack the memory-bound nature of decode.
Solution
Cache = 2 × layers × heads × head_dim × seq × batch × bytes = 2×32×32×128×8192×32×2.
With GQA using 8 KV heads instead of 32, the cache shrinks by 32/8 = 4×, to ≈34 GB — which is why GQA is essential to fit long-context, large-batch serving in memory.
Solution
TTFT = time to first token (prefill latency); TPOT = time per output token (decode speed); latency = total time for a response; throughput = tokens (or requests) served per second across all users.
A chatbot cares most about TTFT and TPOT (perceived responsiveness); an overnight batch job cares about throughput (total tokens/sec, latency irrelevant). The metric you optimize depends on the product.
Solution
Per-group int8 stores each weight as an 8-bit integer plus a shared fp16 scale (and zero-point) per group of weights, so the dominant cost is ~1 byte/weight. A 7B model: fp16 = 14 GB, int8 ≈ 7 GB, int4 ≈ 3.5 GB. Since decode is memory-bandwidth-bound, halving (int8) or quartering (int4) the bytes read per token gives roughly a proportional decode speedup (up to ~2× / ~4×), minus dequantization overhead.
Solution
GPTQ quantizes weights layer-by-layer, using second-order (Hessian) information to minimize the output error introduced, adjusting remaining weights to compensate. AWQ is activation-aware: it identifies the small fraction of weight channels that interact with large-magnitude activations (the salient ones) and protects them (scaling so they quantize with less error). Activation-aware quantization preserves quality because not all weights matter equally — the ones multiplied by large activations dominate the output, so protecting them prevents the worst errors, unlike uniform quantization that treats all weights the same.
Solution
GPTQ/AWQ are GPU-oriented quantization formats optimized for fast GPU inference (used by vLLM, TensorRT). GGUF (llama.cpp) is a CPU/Apple-Silicon-friendly format supporting flexible quantization for local/consumer hardware. For datacenter GPU serving, use GPTQ or AWQ (best GPU throughput); for running on a MacBook, use GGUF with llama.cpp (runs well on CPU/Metal, no datacenter GPU needed). Match the format to the deployment hardware.
Solution
Pre-allocating a contiguous KV cache for each request's MAX length wastes memory when requests are shorter (internal fragmentation) and prevents fitting more requests. PagedAttention borrows OS virtual memory: it splits the KV cache into fixed-size BLOCKS allocated on demand, with a per-request block table mapping logical positions to physical blocks. Like pages in RAM, blocks need not be contiguous and are allocated only as the sequence grows — eliminating fragmentation and letting many more requests share GPU memory.
Solution
Static batching groups a fixed set of requests and waits for ALL to finish before starting the next batch — so finished requests leave the GPU idle while waiting for the slowest. Continuous batching adds new requests and evicts finished ones at the token level, every step, so freed slots are immediately refilled, keeping the GPU full. It requires a scheduler that manages per-request state (KV cache, position) and dynamically composes each step's batch — the core of vLLM-style serving.
Solution
A small draft model proposes k tokens; the large model verifies them in one parallel pass and accepts each with a probability that, via the rejection-sampling correction, makes the final distribution EXACTLY the large model's — so output is identical in distribution to normal decoding. Expected accepted tokens before the first rejection out of k drafts is Σ_{i=1}^{k} α^i = α(1−α^k)/(1−α), plus one 'bonus' token always sampled from the target:
Higher acceptance α and longer drafts k yield more tokens per expensive target pass — the speedup, free of quality loss.
Solution
Caching past keys/values makes each new token cost O(T) instead of recomputing O(T²); the measured speedup grows with sequence length and is large for long generations — the single most important inference optimization (Chapter 12/13 in the serving context).
Solution
Quantizing weights to int8/int4 with per-group scales (Exercise 6) reduces memory ~2×/4×. Measuring the reconstruction error shows int8 is nearly lossless while int4 introduces small but visible error — quantifying the memory/quality trade-off explored in Exercise 14.
Solution
Running the model at fp16 and int4 on a benchmark typically shows a small quality drop (a few points) for int4 in exchange for ~4× memory savings and faster decode — the empirical case that 4-bit quantization is usually worth it for serving, especially with quality-preserving methods (AWQ/GPTQ).
Solution
Temperature flattens/sharpens the distribution; top-k restricts to the k most likely; top-p (nucleus) keeps the smallest set summing to p. Generating text under each shows higher temperature/k/p increasing diversity at some coherence cost — the decoding knobs of Chapter 3 in the serving loop.
Solution
Instrumenting the loop to time the first token (TTFT) separately from subsequent tokens (TPOT) shows TTFT growing with prompt length (more prefill compute) while TPOT stays roughly constant (per-token decode) — confirming the prefill/decode distinction of Exercise 2.
Solution
Modeling a stream of variable-length requests shows static batching leaving the GPU idle as short requests finish and wait for long ones, while continuous batching refills freed slots immediately — yielding markedly higher utilization and throughput (Exercise 10).
Solution
Storing the cache in fixed-size blocks allocated on demand with a block table (Exercise 9) and comparing to max-length contiguous allocation shows far less wasted memory (no over-allocation for short sequences), letting more requests fit — the memory efficiency behind vLLM.
Solution
Drafting k tokens with a small model and verifying with the large model gives a speedup determined by the acceptance rate (Exercise 11), while the rejection-sampling correction makes the output distribution identical to plain decoding — verified by matching samples at the same seed.
Solution
vLLM's paged KV cache and continuous batching give high throughput out of the box; switching to an AWQ-quantized model further reduces memory and can increase throughput (smaller weights, more room for batching) at a small quality cost — the production stack of Exercises 7–10 in one tool.
Solution
Detecting a shared prompt prefix (e.g. a long system prompt repeated across requests) and reusing its cached KV avoids re-prefilling it each time, sharply reducing TTFT for those requests — a large win when many requests share a common preamble.
Solution
Layering the optimizations and adding each in turn shows their cumulative effect: continuous batching and paging raise throughput and memory efficiency, int4 cuts memory and speeds decode, and the combination dramatically outperforms a naive one-request-at-a-time fp16 baseline in both throughput and latency — the integrated payoff of the chapter, with each optimization's contribution isolated.