Solutions Appendix
Chapter 31

Serving at Scale

20 Solutions

Detailed solutions for the exercises in Chapter 31. Try solving them yourself before checking the answers.

Exercise 1Pen & Paper
Why is serving at scale a systems problem, not a model problem? Five things that change from one request to millions.

Solution

The model is fixed; the challenge is the infrastructure around it. From one request to millions, you must add: (1) load balancing across many replicas; (2) autoscaling to match fluctuating demand; (3) queuing and batching to use GPUs efficiently; (4) monitoring of tail latency and cost; (5) fault tolerance / failover and safe rollouts. None of these touch the model weights — they are distributed-systems concerns. Serving is about reliability, latency, and cost at scale, not model quality.

Exercise 2Pen & Paper
Why is streaming essential for interactive products? Connect to TTFT and perceived responsiveness.

Solution

Streaming sends tokens as they are generated rather than waiting for the full response, so the user sees output starting after just the TTFT (time to first token, Chapter 27) instead of after the entire generation. Since TTFT is short but total generation can take many seconds, streaming dramatically improves PERCEIVED responsiveness — the user reads along as the model writes. Without streaming, even a fast model feels slow because nothing appears until it's done.

Exercise 3Pen & Paper
Why do stateless servers scale better than stateful? What trade-off, and how does prefix caching address it?

Solution

Stateless servers hold no per-conversation state, so any replica can handle any request — trivial to load-balance, scale, and recover (a failed replica loses nothing). The trade-off: conversation history/KV cache must be re-sent and re-processed each turn (recomputing prefill), wasting compute. Prefix caching addresses this by caching the KV for shared/repeated prefixes (e.g. a conversation's history or system prompt) and routing follow-ups to where the cache lives, avoiding recomputation while staying largely stateless.

Exercise 4Pen & Paper
Why does round-robin load balancing work poorly for LLMs? Describe an LLM-aware strategy.

Solution

Round-robin assumes requests are equal-cost, but LLM requests vary enormously (a 10-token vs a 4000-token generation), so round-robin can pile long requests on one replica while another idles — terrible tail latency. An LLM-aware balancer routes by actual LOAD — e.g. to the replica with the fewest active tokens, lowest queue depth, or most free KV-cache capacity — spreading work by real cost rather than request count, which evens out utilization and tail latency.

Exercise 5Pen & Paper
Explain the cold-start problem in autoscaling; why does it force predictive scaling?

Solution

Spinning up a new LLM replica is slow — loading tens of gigabytes of weights onto a GPU takes seconds to minutes (the cold start). Purely reactive scaling (add capacity when load spikes) is too late: by the time the replica is ready, the spike has already caused dropped requests and SLA violations. So you must scale PREDICTIVELY — anticipate demand (from trends, schedules, leading indicators) and pre-warm replicas before the load arrives — to hide the cold-start delay.

Exercise 6Pen & Paper
Why are SLAs in percentiles (p95/p99) not averages? Example where average looks fine but tail is terrible.

Solution

Averages hide the experience of the unlucky tail of users, who feel the worst latency. Percentiles (p95/p99) bound the experience for nearly all users. Example: 95 requests at 100 ms and 5 requests at 10,000 ms give a mean of ~600 ms (looks acceptable) but a p99 of 10 s (terrible — 1% of users wait 10 seconds). SLAs use percentiles because a good average can coexist with an unacceptable tail, and users remember the worst experiences.

Exercise 7Pen & Paper
Describe cache-aware routing and model routing; how does each improve latency or cost?

Solution

Cache-aware routing sends a request to the replica that already holds its relevant KV cache (e.g. the same conversation's prefix), avoiding recomputation and reducing TTFT — a latency win. Model routing sends easy queries to a small/cheap model and hard ones to a large/expensive model (via a classifier or heuristic), so most traffic is served cheaply and only hard queries pay for the big model — a cost win with little quality loss. Both route intelligently based on request properties.

Exercise 8Pen & Paper
Compare canary, blue-green, shadow rollouts; why must you always be able to roll back?

Solution

Canary: send a small % of traffic to the new model, watch metrics, ramp up if healthy. Blue-green: run old (blue) and new (green) in parallel, switch all traffic at once (instant cutover, easy rollback). Shadow: send a copy of real traffic to the new model WITHOUT serving its responses, to compare safely. You must always be able to roll back because a new model can regress in subtle ways (quality, latency, cost, safety) only visible under real traffic — instant rollback limits the blast radius of a bad deploy.

Exercise 9Pen & Paper
Explain A/B testing for models; why is it the gold standard over offline benchmarks; how do they complement?

Solution

A/B testing splits live users between model A and B and measures real outcome metrics (engagement, task success, satisfaction). It is the gold standard because it measures actual user impact under real conditions, which offline benchmarks (static, possibly contaminated, narrow) cannot fully capture. They complement: offline benchmarks cheaply screen candidates before deployment (catch regressions early), and A/B tests provide the ground-truth verdict on the survivors. Benchmarks filter; A/B decides.

Exercise 10Pen & Paper
How do naive retries cause a retry storm; how does exponential backoff with jitter prevent it?

Solution

When a service is overloaded, naive immediate retries pile MORE requests onto the struggling service, worsening the overload — a self-amplifying retry storm that can collapse the system. Exponential backoff waits progressively longer between retries (1s, 2s, 4s...), reducing the retry rate so the service can recover; jitter (randomizing the wait) prevents all clients from retrying in synchronized waves. Together they spread retries out in time, breaking the storm.

Exercise 11Code
Build a streaming API endpoint (server-sent events); measure perceived-latency improvement.

Solution

Sending tokens as SSE as they're generated lets the client display output starting at TTFT rather than after full generation — measuring the time-to-first-visible-token shows a large perceived-latency improvement over returning the complete response (Exercise 2).

Exercise 12Code
Implement an LLM-aware load balancer (route to lowest load); compare tail latency to round-robin.

Solution

Routing to the replica with the lowest active load (queue depth / active tokens) versus round-robin, under variable request sizes, shows the LLM-aware balancer achieving much better p95/p99 tail latency (Exercise 4) by avoiding piling long requests on one replica.

Exercise 13Code Lab
Simulate autoscaling with a cold-start delay; compare reactive vs predictive scaling.

Solution

Modeling a fluctuating request stream and a scale-up cold-start lag shows reactive scaling dropping requests / violating latency during spikes (capacity arrives too late), while predictive scaling pre-warms replicas and rides out the spike — demonstrating why cold starts force prediction (Exercise 5), at some cost of running spare capacity.

Exercise 14Code
Compute and plot p50/p95/p99 latency from request timings; show the average hides the tail.

Solution

Computing the percentiles from a timing stream and plotting them alongside the mean shows a low average masking a high p99 (Exercise 6) — making concrete why SLAs are written in percentiles, not averages.

Exercise 15Code
Implement cache-aware routing; route conversation follow-ups to the replica holding the cached prefix; measure TTFT improvement.

Solution

Routing follow-up turns to the replica that already cached the conversation's prefix avoids re-prefilling the history, measurably reducing TTFT versus routing blindly (Exercise 7) — the latency benefit of cache-aware routing.

Exercise 16Code
Implement model routing: easy queries → small model, hard → large; measure cost savings and quality.

Solution

A classifier/heuristic directing easy queries to a small model and hard ones to a large model serves most traffic cheaply, cutting average cost substantially with little overall quality loss (Exercise 7) — demonstrating the cost lever of routing by difficulty.

Exercise 17Code Lab
Simulate an A/B test: assign users to A/B, generate outcomes with a real difference, run a significance test.

Solution

Splitting simulated users, generating metrics with a planted difference, and running a significance test (e.g. t-test) shows whether B's improvement is statistically real or noise — the methodology of Exercise 9 for deciding deployments on evidence, not vibes.

Exercise 18Code
Implement retries with exponential backoff and jitter; show naive retries storm while backoff recovers.

Solution

Simulating a struggling service, naive immediate retries amplify the overload (the storm of Exercise 10) while exponential backoff with jitter throttles and de-synchronizes retries, letting the service recover — a direct demonstration of the resilience pattern.

Exercise 19Code
Build a cost dashboard: track tokens served and cost per token across replicas by request type; find the cost driver.

Solution

Aggregating token counts and cost by request type and replica surfaces the biggest cost driver (often long-output or large-model requests) — the visibility needed to target optimizations (e.g. model routing) where they save the most, the FinOps side of serving.

Exercise 20Code (Challenge)
Build a mini serving system tying Part VI together (streaming gateway, LLM-aware LB, continuous batching, routing, autoscaling, p99/cost monitoring, canary); drive with a fluctuating workload, fail a replica, show failover.

Solution

The capstone integrates the chapter: a streaming, rate-limited gateway over load-balanced replicas (each using Chapter-27 continuous batching), with cache/model routing, predictive autoscaling, p99-latency and cost monitoring, and a canary rollout. Driving it with a realistic workload shows latency, throughput, and cost responding to each mechanism; deliberately failing a replica and observing failover keep the SLA demonstrates fault tolerance — the full production-serving picture, and the close of Part VI.