Serving at Scale
Detailed solutions for the exercises in Chapter 31. Try solving them yourself before checking the answers.
Solution
The model is fixed; the challenge is the infrastructure around it. From one request to millions, you must add: (1) load balancing across many replicas; (2) autoscaling to match fluctuating demand; (3) queuing and batching to use GPUs efficiently; (4) monitoring of tail latency and cost; (5) fault tolerance / failover and safe rollouts. None of these touch the model weights — they are distributed-systems concerns. Serving is about reliability, latency, and cost at scale, not model quality.
Solution
Streaming sends tokens as they are generated rather than waiting for the full response, so the user sees output starting after just the TTFT (time to first token, Chapter 27) instead of after the entire generation. Since TTFT is short but total generation can take many seconds, streaming dramatically improves PERCEIVED responsiveness — the user reads along as the model writes. Without streaming, even a fast model feels slow because nothing appears until it's done.
Solution
Stateless servers hold no per-conversation state, so any replica can handle any request — trivial to load-balance, scale, and recover (a failed replica loses nothing). The trade-off: conversation history/KV cache must be re-sent and re-processed each turn (recomputing prefill), wasting compute. Prefix caching addresses this by caching the KV for shared/repeated prefixes (e.g. a conversation's history or system prompt) and routing follow-ups to where the cache lives, avoiding recomputation while staying largely stateless.
Solution
Round-robin assumes requests are equal-cost, but LLM requests vary enormously (a 10-token vs a 4000-token generation), so round-robin can pile long requests on one replica while another idles — terrible tail latency. An LLM-aware balancer routes by actual LOAD — e.g. to the replica with the fewest active tokens, lowest queue depth, or most free KV-cache capacity — spreading work by real cost rather than request count, which evens out utilization and tail latency.
Solution
Spinning up a new LLM replica is slow — loading tens of gigabytes of weights onto a GPU takes seconds to minutes (the cold start). Purely reactive scaling (add capacity when load spikes) is too late: by the time the replica is ready, the spike has already caused dropped requests and SLA violations. So you must scale PREDICTIVELY — anticipate demand (from trends, schedules, leading indicators) and pre-warm replicas before the load arrives — to hide the cold-start delay.
Solution
Averages hide the experience of the unlucky tail of users, who feel the worst latency. Percentiles (p95/p99) bound the experience for nearly all users. Example: 95 requests at 100 ms and 5 requests at 10,000 ms give a mean of ~600 ms (looks acceptable) but a p99 of 10 s (terrible — 1% of users wait 10 seconds). SLAs use percentiles because a good average can coexist with an unacceptable tail, and users remember the worst experiences.
Solution
Cache-aware routing sends a request to the replica that already holds its relevant KV cache (e.g. the same conversation's prefix), avoiding recomputation and reducing TTFT — a latency win. Model routing sends easy queries to a small/cheap model and hard ones to a large/expensive model (via a classifier or heuristic), so most traffic is served cheaply and only hard queries pay for the big model — a cost win with little quality loss. Both route intelligently based on request properties.
Solution
Canary: send a small % of traffic to the new model, watch metrics, ramp up if healthy. Blue-green: run old (blue) and new (green) in parallel, switch all traffic at once (instant cutover, easy rollback). Shadow: send a copy of real traffic to the new model WITHOUT serving its responses, to compare safely. You must always be able to roll back because a new model can regress in subtle ways (quality, latency, cost, safety) only visible under real traffic — instant rollback limits the blast radius of a bad deploy.
Solution
A/B testing splits live users between model A and B and measures real outcome metrics (engagement, task success, satisfaction). It is the gold standard because it measures actual user impact under real conditions, which offline benchmarks (static, possibly contaminated, narrow) cannot fully capture. They complement: offline benchmarks cheaply screen candidates before deployment (catch regressions early), and A/B tests provide the ground-truth verdict on the survivors. Benchmarks filter; A/B decides.
Solution
When a service is overloaded, naive immediate retries pile MORE requests onto the struggling service, worsening the overload — a self-amplifying retry storm that can collapse the system. Exponential backoff waits progressively longer between retries (1s, 2s, 4s...), reducing the retry rate so the service can recover; jitter (randomizing the wait) prevents all clients from retrying in synchronized waves. Together they spread retries out in time, breaking the storm.
Solution
Sending tokens as SSE as they're generated lets the client display output starting at TTFT rather than after full generation — measuring the time-to-first-visible-token shows a large perceived-latency improvement over returning the complete response (Exercise 2).
Solution
Routing to the replica with the lowest active load (queue depth / active tokens) versus round-robin, under variable request sizes, shows the LLM-aware balancer achieving much better p95/p99 tail latency (Exercise 4) by avoiding piling long requests on one replica.
Solution
Modeling a fluctuating request stream and a scale-up cold-start lag shows reactive scaling dropping requests / violating latency during spikes (capacity arrives too late), while predictive scaling pre-warms replicas and rides out the spike — demonstrating why cold starts force prediction (Exercise 5), at some cost of running spare capacity.
Solution
Computing the percentiles from a timing stream and plotting them alongside the mean shows a low average masking a high p99 (Exercise 6) — making concrete why SLAs are written in percentiles, not averages.
Solution
Routing follow-up turns to the replica that already cached the conversation's prefix avoids re-prefilling the history, measurably reducing TTFT versus routing blindly (Exercise 7) — the latency benefit of cache-aware routing.
Solution
A classifier/heuristic directing easy queries to a small model and hard ones to a large model serves most traffic cheaply, cutting average cost substantially with little overall quality loss (Exercise 7) — demonstrating the cost lever of routing by difficulty.
Solution
Splitting simulated users, generating metrics with a planted difference, and running a significance test (e.g. t-test) shows whether B's improvement is statistically real or noise — the methodology of Exercise 9 for deciding deployments on evidence, not vibes.
Solution
Simulating a struggling service, naive immediate retries amplify the overload (the storm of Exercise 10) while exponential backoff with jitter throttles and de-synchronizes retries, letting the service recover — a direct demonstration of the resilience pattern.
Solution
Aggregating token counts and cost by request type and replica surfaces the biggest cost driver (often long-output or large-model requests) — the visibility needed to target optimizations (e.g. model routing) where they save the most, the FinOps side of serving.
Solution
The capstone integrates the chapter: a streaming, rate-limited gateway over load-balanced replicas (each using Chapter-27 continuous batching), with cache/model routing, predictive autoscaling, p99-latency and cost monitoring, and a canary rollout. Driving it with a realistic workload shows latency, throughput, and cost responding to each mechanism; deliberately failing a replica and observing failover keep the SLA demonstrates fault tolerance — the full production-serving picture, and the close of Part VI.