Solutions Appendix
Chapter 27

Inference Optimization

22 Solutions

Detailed solutions for the exercises in Chapter 27. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Why does inference cost dominate training over a deployed model's lifetime? Implication?

Solution

Training happens once; inference runs for every user request, forever. A popular model serves billions of tokens daily, so cumulative inference compute quickly dwarfs the one-time training cost. Implication: optimization effort should focus heavily on inference (quantization, batching, caching), and model sizing should account for lifetime serving cost (Chapter 16's inference-aware optimum) — a 2× inference speedup can save more than the entire training bill.

Exercise 2Pen & Paper
Distinguish prefill and decode; why is prefill compute-bound and decode memory-bound? TTFT vs TPOT?

Solution

Prefill processes the whole prompt in parallel — large matmuls with high arithmetic intensity, so it is compute-bound and determines time-to-first-token (TTFT). Decode generates one token at a time, each step reading all weights and the KV cache to produce a single token — little compute per byte moved, so it is memory-bandwidth-bound and determines time-per-output-token (TPOT). Prefill→TTFT, decode→TPOT.

Exercise 3Pen & Paper
Why does decode underutilize the GPU? Connect to the two fix families.

Solution

Decode does tiny matmuls (one token) while loading huge weight matrices, so the compute units sit idle waiting on memory — low utilization. The two fixes: (1) make the weights SMALLER (quantization) so less must be read per token, easing the memory bottleneck; (2) BATCH many requests so the loaded weights are reused across many tokens, raising arithmetic intensity and filling the GPU. Both attack the memory-bound nature of decode.

Exercise 4Derive
KV-cache size for 7B (32 layers, 32 heads, 128 dim), 8k context, batch 32, fp16; GQA-8 effect?

Solution

Cache = 2 × layers × heads × head_dim × seq × batch × bytes = 2×32×32×128×8192×32×2.

≈ 1.37×10¹¹ bytes ≈ 137 GB

With GQA using 8 KV heads instead of 32, the cache shrinks by 32/8 = 4×, to ≈34 GB — which is why GQA is essential to fit long-context, large-batch serving in memory.

Exercise 5Derive
Define TTFT, TPOT, latency, throughput; write the latency formula; which metrics matter for chatbot vs batch?

Solution

TTFT = time to first token (prefill latency); TPOT = time per output token (decode speed); latency = total time for a response; throughput = tokens (or requests) served per second across all users.

latency ≈ TTFT + TPOT × (number of output tokens)

A chatbot cares most about TTFT and TPOT (perceived responsiveness); an overnight batch job cares about throughput (total tokens/sec, latency irrelevant). The metric you optimize depends on the product.

Exercise 6Derive
Per-group int8 quantization storage; 7B memory at fp16/int8/int4; expected decode speedup.

Solution

Per-group int8 stores each weight as an 8-bit integer plus a shared fp16 scale (and zero-point) per group of weights, so the dominant cost is ~1 byte/weight. A 7B model: fp16 = 14 GB, int8 ≈ 7 GB, int4 ≈ 3.5 GB. Since decode is memory-bandwidth-bound, halving (int8) or quartering (int4) the bytes read per token gives roughly a proportional decode speedup (up to ~2× / ~4×), minus dequantization overhead.

Exercise 7Pen & Paper
Explain GPTQ and AWQ; how do they differ? Why does activation-aware quantization preserve quality?

Solution

GPTQ quantizes weights layer-by-layer, using second-order (Hessian) information to minimize the output error introduced, adjusting remaining weights to compensate. AWQ is activation-aware: it identifies the small fraction of weight channels that interact with large-magnitude activations (the salient ones) and protects them (scaling so they quantize with less error). Activation-aware quantization preserves quality because not all weights matter equally — the ones multiplied by large activations dominate the output, so protecting them prevents the worst errors, unlike uniform quantization that treats all weights the same.

Exercise 8Pen & Paper
GPTQ/AWQ vs GGUF; datacenter GPU serving vs MacBook — which and why?

Solution

GPTQ/AWQ are GPU-oriented quantization formats optimized for fast GPU inference (used by vLLM, TensorRT). GGUF (llama.cpp) is a CPU/Apple-Silicon-friendly format supporting flexible quantization for local/consumer hardware. For datacenter GPU serving, use GPTQ or AWQ (best GPU throughput); for running on a MacBook, use GGUF with llama.cpp (runs well on CPU/Metal, no datacenter GPU needed). Match the format to the deployment hardware.

Exercise 9Pen & Paper
Explain KV-cache fragmentation; how does PagedAttention solve it (OS paging analogy)?

Solution

Pre-allocating a contiguous KV cache for each request's MAX length wastes memory when requests are shorter (internal fragmentation) and prevents fitting more requests. PagedAttention borrows OS virtual memory: it splits the KV cache into fixed-size BLOCKS allocated on demand, with a per-request block table mapping logical positions to physical blocks. Like pages in RAM, blocks need not be contiguous and are allocated only as the sequence grows — eliminating fragmentation and letting many more requests share GPU memory.

Exercise 10Pen & Paper
Static vs continuous batching; why does continuous batching keep the GPU fuller? What does it require?

Solution

Static batching groups a fixed set of requests and waits for ALL to finish before starting the next batch — so finished requests leave the GPU idle while waiting for the slowest. Continuous batching adds new requests and evicts finished ones at the token level, every step, so freed slots are immediately refilled, keeping the GPU full. It requires a scheduler that manages per-request state (KV cache, position) and dynamically composes each step's batch — the core of vLLM-style serving.

Exercise 11Derive
Why is speculative decoding output-identical? Derive expected accepted tokens per pass vs α and draft length k.

Solution

A small draft model proposes k tokens; the large model verifies them in one parallel pass and accepts each with a probability that, via the rejection-sampling correction, makes the final distribution EXACTLY the large model's — so output is identical in distribution to normal decoding. Expected accepted tokens before the first rejection out of k drafts is Σ_{i=1}^{k} α^i = α(1−α^k)/(1−α), plus one 'bonus' token always sampled from the target:

E[tokens per pass] = 1 + α(1−α^k)/(1−α)

Higher acceptance α and longer drafts k yield more tokens per expensive target pass — the speedup, free of quality loss.

Exercise 12Code
Implement the KV cache; measure cached vs uncached speedup as sequence grows.

Solution

Caching past keys/values makes each new token cost O(T) instead of recomputing O(T²); the measured speedup grows with sequence length and is large for long generations — the single most important inference optimization (Chapter 12/13 in the serving context).

Exercise 13Code
Implement per-group int8 and int4 weight quantization; measure memory saved and error.

Solution

Quantizing weights to int8/int4 with per-group scales (Exercise 6) reduces memory ~2×/4×. Measuring the reconstruction error shows int8 is nearly lossless while int4 introduces small but visible error — quantifying the memory/quality trade-off explored in Exercise 14.

Exercise 14Code
Measure quantization quality: fp16 vs int4 on a benchmark; report quality drop and gains.

Solution

Running the model at fp16 and int4 on a benchmark typically shows a small quality drop (a few points) for int4 in exchange for ~4× memory savings and faster decode — the empirical case that 4-bit quantization is usually worth it for serving, especially with quality-preserving methods (AWQ/GPTQ).

Exercise 15Code
Implement temperature, top-k, top-p sampling; show how each changes diversity/quality.

Solution

Temperature flattens/sharpens the distribution; top-k restricts to the k most likely; top-p (nucleus) keeps the smallest set summing to p. Generating text under each shows higher temperature/k/p increasing diversity at some coherence cost — the decoding knobs of Chapter 3 in the serving loop.

Exercise 16Code
Measure TTFT and TPOT separately; vary prompt length and observe TTFT.

Solution

Instrumenting the loop to time the first token (TTFT) separately from subsequent tokens (TPOT) shows TTFT growing with prompt length (more prefill compute) while TPOT stays roughly constant (per-token decode) — confirming the prefill/decode distinction of Exercise 2.

Exercise 17Code Lab
Simulate static vs continuous batching; measure GPU utilization and throughput.

Solution

Modeling a stream of variable-length requests shows static batching leaving the GPU idle as short requests finish and wait for long ones, while continuous batching refills freed slots immediately — yielding markedly higher utilization and throughput (Exercise 10).

Exercise 18Code Lab
Implement a simplified PagedAttention KV cache; measure memory waste vs contiguous.

Solution

Storing the cache in fixed-size blocks allocated on demand with a block table (Exercise 9) and comparing to max-length contiguous allocation shows far less wasted memory (no over-allocation for short sequences), letting more requests fit — the memory efficiency behind vLLM.

Exercise 19Code Lab
Implement speculative decoding with a draft + large model; measure acceptance and speedup; verify identical output.

Solution

Drafting k tokens with a small model and verifying with the large model gives a speedup determined by the acceptance rate (Exercise 11), while the rejection-sampling correction makes the output distribution identical to plain decoding — verified by matching samples at the same seed.

Exercise 20Code
Serve a model with vLLM; measure throughput; enable AWQ and compare memory/throughput.

Solution

vLLM's paged KV cache and continuous batching give high throughput out of the box; switching to an AWQ-quantized model further reduces memory and can increase throughput (smaller weights, more room for batching) at a small quality cost — the production stack of Exercises 7–10 in one tool.

Exercise 21Code
Implement prefix caching; measure TTFT reduction for a fixed long system prompt.

Solution

Detecting a shared prompt prefix (e.g. a long system prompt repeated across requests) and reusing its cached KV avoids re-prefilling it each time, sharply reducing TTFT for those requests — a large win when many requests share a common preamble.

Exercise 22Code (Challenge)
Build a mini inference server combining paged KV cache, continuous batching, int4 weights, sampling; benchmark vs naive fp16.

Solution

Layering the optimizations and adding each in turn shows their cumulative effect: continuous batching and paging raise throughput and memory efficiency, int4 cuts memory and speeds decode, and the combination dramatically outperforms a naive one-request-at-a-time fp16 baseline in both throughput and latency — the integrated payoff of the chapter, with each optimization's contribution isolated.