Part VI: Productionization
Chapter 31

Serving at Scale

Batching, KV cache, speculative decoding, and inference
20 Exercises
31.1

Chapter 27 made a single model run fast on a single machine. But a real product serves thousands of requests per second, from users around the world, around the clock, reliably, within a budget. That is no longer a model problem — it is a SYSTEMS problem. This final chapter of Part VI is about the engineering that turns an optimized model into a dependable, scalable SERVICE.

What Changes at Scale

A demo serving one request at a time hides nearly everything that matters in production. At scale, you must handle many concurrent users, spread load across many expensive GPUs, survive hardware failures without downtime, meet latency promises even under load spikes, roll out new models without breaking anything, and keep the enormous compute bill under control. None of this is about the model itself; all of it determines whether the product works.

ConcernOne request (Ch. 27)At scale (this chapter)
ConcurrencyOne at a timeThousands per second
HardwareOne GPUFleets of GPUs across regions
FailureRestart and retrySurvive failures, no downtime
LatencyBest effortGuaranteed by an SLA
UpdatesReload the modelSafe rollouts, versioning
CostRun itOptimize across the fleet
Scale Note: The Model Is the Easy Part
A recurring theme of Part VI reaches its peak here: the trained model is the EASY part of a production system. The hard part is everything around it — the API, the load balancer, the autoscaler, the monitoring, the failover, the cost controls, the deployment pipeline. A brilliant model behind a fragile, slow, or expensive serving stack is a failed product; a good model behind excellent infrastructure is a great one.
This chapter is the least about machine learning and the most about systems engineering — and that is precisely the point. Deploying AI at scale is a software-and-infrastructure discipline. The ML got you a capable model; the systems engineering is what makes it a service people can depend on.

The Three Forces: Latency, Throughput, Cost

Everything in production serving balances three forces, building on Chapter 27's metrics: LATENCY (responses must be fast enough for users), THROUGHPUT (the system must handle the total load), and COST (it must fit a budget). These pull against each other — lower latency often means more cost, higher throughput can hurt latency — and the art of serving at scale is balancing all three under real-world conditions.

31.2

The API is how everyone — your own apps, external developers, other services — talks to the model. A well-designed API is the foundation of a usable service; a poorly-designed one creates friction forever. Let us cover the essentials of designing an LLM API.

Core API Design Choices

DecisionOptions / considerations
StreamingStream tokens as generated (low perceived latency) vs return all at once
Sync vs asyncImmediate response vs submit-and-poll for long jobs
Stateless vs statefulSend full context each time vs server keeps conversation state
Batching interfaceSingle requests vs a batch endpoint for bulk jobs
ParametersExpose temperature, max tokens, stop sequences, etc.
Error formatClear, consistent, actionable error responses
VersioningVersion the API so changes don't break clients

Streaming: The Most Important Choice

For interactive use, STREAMING is essential. Instead of waiting for the entire response (which could be many seconds for a long answer), the API streams tokens to the user AS THEY ARE GENERATED. The user sees text appearing immediately, dramatically improving perceived responsiveness — the response starts at the TTFT (Chapter 27), not after the full generation. Almost every interactive LLM product streams.

PythonA streaming LLM API endpoint (sketch)
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post('/v1/generate')
async def generate(request: GenerateRequest):
    """Stream tokens to the client as they are generated."""
    async def token_stream():
        async for token in model.generate_stream(
            request.prompt, max_tokens=request.max_tokens,
            temperature=request.temperature):
            yield f'data: {token}\n\n'   # server-sent events

    return StreamingResponse(token_stream(), media_type='text/event-stream')

# The user sees tokens appear immediately (at TTFT), not after the
# whole response is done. This is why chat UIs feel responsive.
Scale Note: Stateless Servers Scale Better
A key design principle: make the serving layer STATELESS where possible. If each request carries its full context (the whole conversation), then ANY server can handle ANY request — you can freely load-balance across the fleet and add or remove servers at will. If servers hold conversation state, requests must be 'sticky' to a particular server, which complicates load balancing and failure recovery. Statelessness is what makes horizontal scaling clean.
The trade-off is that stateless servers re-process the conversation context each turn (more compute), which prefix caching (Chapter 27) mitigates by reusing the cached prefix. The common pattern: stateless servers plus a shared cache — the best of both, scalable and efficient.
31.3

One GPU cannot serve millions of users. Production systems run many REPLICAS — copies of the model on many GPUs — with a LOAD BALANCER spreading requests across them and an AUTOSCALER adding or removing replicas as demand changes. This is the backbone of scalable serving.

Load Balancing: Spreading the Work

A load balancer sits in front of the replicas and routes each incoming request to one of them. The goal is to keep all replicas evenly busy — no replica overloaded while others sit idle. Naive strategies like round-robin (rotate through replicas) work poorly for LLMs because requests vary wildly in cost (a 10-token reply vs a 2,000-token reply). LLM-aware load balancing routes based on actual LOAD — current queue depth, KV-cache usage, or number of active requests per replica.

Device Grid: Load balancer spreading requests across GPU replicas

GPU 0GPU 1GPU 2GPU 3
Replicasmodelmodelmodelmodel
Load60%55%62%58%

Autoscaling: Matching Capacity to Demand

Demand is not constant — it spikes during the day, drops at night, surges with viral events. AUTOSCALING adds replicas when demand rises (to maintain latency) and removes them when demand falls (to save cost). The challenge for LLMs: spinning up a new replica is SLOW — loading a large model onto a GPU can take minutes — so autoscaling must be PREDICTIVE (scale up before the spike, based on trends) rather than purely reactive, or users hit slow responses while new capacity warms up.

StrategyHow it decides
Reactive scalingAdd replicas when load/latency crosses a threshold (can lag)
Predictive scalingForecast demand and scale ahead of it
Scheduled scalingScale on known patterns (e.g. up at 9am, down at night)
Queue-basedScale to keep request queue depth bounded
Scale Note: Cold Starts Are the Autoscaling Challenge
The biggest practical headache in LLM autoscaling is the COLD START: a new replica must load gigabytes of model weights onto a GPU before it can serve, taking seconds to minutes. By the time a reactive autoscaler notices high load and starts a replica, users have already suffered slow responses. Mitigations include keeping warm spare replicas, predictive scaling, faster model loading, and over-provisioning slightly to absorb spikes.
This is a key difference from traditional web services, where a new server starts in seconds. The sheer size of LLM weights makes capacity slow to add, which is why LLM serving leans heavily on prediction and warm pools rather than purely reactive scaling. Plan capacity for the spike, not the average.
31.4

A production service makes PROMISES about its performance — a Service Level Agreement (SLA) — and must measure whether it keeps them. Defining and meeting SLAs is central to running a dependable service, and it requires the right metrics.

What an SLA Specifies

An SLA defines the level of service users can expect, typically covering: AVAILABILITY (uptime, e.g. 99.9%), LATENCY (e.g. 'p95 time-to-first-token under 500ms'), and sometimes throughput or error rates. The service is engineered, monitored, and over-provisioned to meet these targets. Falling short has consequences — unhappy users, broken contracts, financial penalties.

SLA (Service Level Agreement)
A commitment to a measurable level of service — such as 99.9% availability and p95 latency under a threshold — that the service is engineered and monitored to meet.

Percentiles, Not Averages

A crucial lesson: measure latency with PERCENTILES, not averages. The average hides the bad cases — if 99% of requests are fast but 1% take 30 seconds, the average looks fine while some users have a terrible experience. The p95 (95th percentile) and p99 (99th percentile) latencies capture the TAIL — how bad the slow requests are. SLAs are written in percentiles ('p99 latency < 2s') because they reflect the worst experiences real users actually have.

textWhy percentiles beat averages
100 requests: 99 take 0.5s, 1 takes 30s
    average  = (99×0.5 + 30) / 100 = 0.795s   <- looks fine!
    p99      = 30s                            <- the truth: 1% are awful

# SLAs use p95/p99 because they capture the TAIL users actually feel.
# Tail latency, not average, determines user experience at scale.
MetricMeaning
AvailabilityFraction of time the service is up (99.9% = ~9hr/yr down)
p50 / p95 / p99 latencyMedian / tail latency — the slow-case experience
Error rateFraction of requests that fail
ThroughputRequests or tokens served per second
SaturationHow close the fleet is to capacity
Scale Note: Tail Latency Is the Enemy
At scale, the TAIL (p99, p999) is what hurts. With millions of requests, even a tiny fraction of slow ones is a large number of frustrated users — and a single user's session may involve many requests, so the chance of hitting at least one slow one compounds. Much of serving engineering is about taming the tail: avoiding stragglers, handling load spikes, and ensuring no request waits too long behind others.
This is why LLM-aware scheduling and load balancing matter so much: a long request can block short ones (head-of-line blocking), spiking tail latency. Continuous batching (Chapter 27), fair scheduling, and routing all exist partly to keep the tail under control.
31.5

Beyond simple load balancing, a sophisticated serving system ROUTES requests intelligently and SCHEDULES them to balance competing goals. With diverse requests (short and long, cheap and expensive, free and paid) and a heterogeneous fleet (different GPUs, different models), good routing and scheduling are key to both performance and cost.

Request Routing

Routing decides WHICH replica or model handles a request. Smart routing considers: which replica is least loaded, which already has the relevant prefix cached (route a follow-up to the same replica — 'cache-aware routing'), which model size suits the request (route easy queries to a small fast model, hard ones to a large model — 'model routing'), and the request's priority tier. Good routing can dramatically improve both latency and cost.

Tool Trace: Intelligent request routing

UserSends a follow-up message in an ongoing chat
RouterSees the conversation prefix is cached on Replica 2
RouterRoutes to Replica 2 to reuse its prefix cache (fast TTFT)
Replica 2Reuses cached prefix, generates only the new turn
UserGets a fast response — no re-processing the whole history

Scheduling: Who Goes First

Within a replica, the SCHEDULER decides the order requests are served and how they share the GPU (building on continuous batching from Chapter 27). It balances fairness (no request starves), priority (paid or interactive requests first), and efficiency (keep the batch full). A key concern is preventing a few huge requests from monopolizing the GPU and starving everyone else — so schedulers may preempt, chunk, or cap long generations to protect the tail latency of short ones.

Routing/scheduling ideaBenefit
Cache-aware routingRoute to the replica with the prefix cached → fast TTFT
Model routingEasy queries to small models, hard to large → save cost
Priority tiersInteractive/paid requests served first
Fair schedulingNo request starves behind big ones
Preemption / chunkingLong requests don't block short ones
Geographic routingServe from the nearest region → lower latency
Scale Note: Model Routing Saves Real Money
A powerful cost lever: not every query needs your biggest, most expensive model. MODEL ROUTING sends easy queries (simple questions, short completions) to a small fast cheap model, and reserves the large expensive model for genuinely hard queries. A classifier or the small model itself can decide. Since most queries are easy, routing the bulk of traffic to a cheaper model can cut costs dramatically while preserving quality where it matters.
This echoes the adaptive-compute idea from reasoning (Chapter 25): spend expensive capability only where it is needed. At the fleet level, model routing is one of the highest-impact cost optimizations — matching each request to the cheapest model that can handle it well.
31.6

Models are updated regularly — improved versions, fine-tunes, fixes. But a new model can behave differently, and a bad update can degrade quality or break downstream systems for millions of users at once. MODEL VERSIONING and SAFE ROLLOUT practices let you change models without risking the whole product.

Why Versioning Matters

Users and downstream systems may depend on a model's specific behaviour. Silently swapping in a new model can break prompts that were tuned for the old one, change output formats that code parses, or shift quality in unexpected ways. VERSIONING means each model version is named and addressable, old versions remain available for a transition period, and clients can pin to a specific version. This lets the system evolve without yanking the ground out from under anyone.

Safe Rollout Strategies

StrategyHow it works
Canary releaseSend a small % of traffic to the new model; watch metrics; expand if good
Blue-greenRun old (blue) and new (green) in parallel; switch over instantly; roll back fast
Shadow / mirrorSend traffic to the new model WITHOUT using its output, to compare safely
Gradual rolloutRamp traffic to the new model slowly (1% → 10% → 50% → 100%)
Feature flagsToggle the new model on/off per user or cohort instantly

Pipeline Flow: A safe model rollout

1Shadow testRun the new model on real traffic without serving its output; compare
2CanaryServe the new model to 1% of users; watch quality, latency, errors
3RampGradually increase: 1% → 10% → 50% as metrics stay healthy
4Full + monitorRoll out to 100%, keep the old version ready to roll back
Scale Note: Always Be Able to Roll Back
The golden rule of safe rollouts: NEVER deploy a change you cannot quickly undo. Keep the previous model version warm and ready, and have a one-click (or automatic) rollback if metrics degrade. Even with careful shadow testing and canaries, a new model can surprise you in production at full scale. The ability to roll back in seconds turns a potential disaster into a minor blip.
This is standard practice from software deployment, applied to models: gradual rollout, continuous monitoring, instant rollback. A model is just another component being deployed — treat its rollout with the same discipline as any production change, because at scale the blast radius of a bad model is enormous.
31.7

How do you know a new model or prompt is actually BETTER, not just different? Offline benchmarks (Chapter 21) help, but the real test is how it performs with REAL users on REAL traffic. A/B TESTING and production evaluation answer this rigorously, building on the evaluation discipline from Part IV.

A/B Testing

In an A/B test, you serve the OLD model (A) to one randomly-chosen group of users and the NEW model (B) to another, then compare outcomes — user satisfaction, task completion, engagement, thumbs-up rates, retention. Because users are randomly assigned, differences in outcomes can be attributed to the model change. This is the gold standard for deciding whether a change genuinely improves the product, not just the benchmark.

Tool Trace: An A/B test in production

User poolRandomly split into group A and group B
Group AServed the current model (control)
Group BServed the new model (treatment)
MetricsCompare satisfaction, completion, retention between A and B
DecisionIf B is significantly better, roll it out; else, don't

What to Measure

Production evaluation goes beyond benchmark scores to real outcomes: explicit feedback (thumbs up/down, ratings), implicit signals (did the user retry, rephrase, or abandon?), task completion, latency, and cost. The key discipline — echoing the 'distrust the proxy' lesson of Chapter 23 — is to measure what actually MATTERS to users, not just what is easy to measure. A model that scores higher on a benchmark but frustrates real users is not an improvement.

Scale Note: Online and Offline Evaluation Complement
Offline benchmarks (Chapter 21) are fast, cheap, and reproducible but may not reflect real use. Online A/B tests measure real-user impact but are slower, costlier, and noisier. The two complement each other: use offline evals to catch regressions and screen candidates quickly, then A/B test the promising ones to confirm real-world benefit before full rollout. Neither alone is sufficient.
And both connect to the deployment safety of Section 31.6: shadow testing, canaries, and A/B tests form a graduated evaluation pipeline — each stage exposes the new model to more real traffic with more confidence, so problems surface before they reach everyone. Evaluation is not a one-time gate but a continuous part of safe operation.
31.8

At scale, failures are not exceptional — they are constant. GPUs fail, networks hiccup, replicas crash, dependencies time out, traffic spikes. A reliable service is designed to keep working DESPITE failures, not to assume they won't happen. Fault tolerance is what separates a service that has occasional bad days from one users can depend on.

Designing for Failure

TechniqueWhat it protects against
RedundancyMany replicas — one failing doesn't take down the service
Health checksDetect and remove unhealthy replicas automatically
FailoverReroute requests from a failed replica to healthy ones
Retries (with backoff)Transient failures — retry, but not in a way that amplifies load
TimeoutsDon't let one stuck request hang forever
Circuit breakersStop hammering a failing dependency; fail fast
Graceful degradationFall back to a smaller model / cached / simpler response
Rate limitingProtect the system from overload and abuse

Graceful Degradation

A crucial reliability idea: when the system is overloaded or partially failing, DEGRADE GRACEFULLY rather than collapse. Under extreme load, it is better to serve everyone a slightly-worse response (a smaller faster model, a cached answer, a shorter generation) than to serve some users perfectly while others get errors or time out. A service that bends under pressure beats one that breaks.

⚠️
Retries Can Make Things Worse
A subtle but important danger: naive retries can turn a small problem into a catastrophe. If a service slows down and every client immediately retries, the retries MULTIPLY the load, pushing the struggling service further over the edge — a 'retry storm' that can cause a full outage. Always use retries with EXPONENTIAL BACKOFF (wait longer between each retry) and JITTER (randomize the timing), and cap the number of retries, so recovery is helped, not hindered.
This is a classic distributed-systems lesson that applies fully to LLM serving: the mechanisms meant to improve reliability (retries) can destroy it if implemented carelessly. Reliability engineering is full of such counterintuitive traps, which is why it is its own discipline — and why serving at scale is far more than just running the model.
31.9

LLM serving is expensive — GPUs cost a lot, and at scale the bill is enormous. Cost optimization is not an afterthought; it can be the difference between a viable product and an unprofitable one. Fortunately, many levers reduce cost, layering on top of the per-request optimizations of Chapter 27.

The Levers of Cost

LeverHow it cuts cost
QuantizationSmaller models → cheaper GPUs, more throughput (Ch. 27)
Continuous batchingMore requests per GPU → fewer GPUs needed (Ch. 27)
Model routingCheap model for easy queries, big model only when needed
CachingReuse results for repeated/similar queries; prefix caching
Right-sizingMatch GPU type to the workload; don't over-provision
AutoscalingDon't pay for idle capacity at off-peak times
Spot / preemptibleUse cheaper interruptible instances for batch work
DistillationServe a smaller distilled model where quality allows

The Cost-per-Token Mindset

The fundamental unit of LLM serving cost is COST PER TOKEN (or per request). Every optimization ultimately aims to lower it: quantization and batching lower the GPU cost of producing each token; routing and caching avoid producing tokens with an expensive model when a cheaper path suffices. Tracking cost per token across the fleet, and per feature, reveals where the money goes and where optimization pays off most.

textCost per token (the key unit)
cost_per_token ≈ (GPU $/hour) / (tokens/hour per GPU)

Lower it by:
  ↑ tokens/hour: batching, quantization, better kernels (Ch. 27)
  ↓ GPU $/hour: right-sizing, spot instances, cheaper hardware
  avoid tokens: caching, routing easy queries to cheap models
Scale Note: Caching Is Free Money
Caching deserves special emphasis as a cost lever. Many real workloads have repeated or near-repeated queries (the same question, the same system prompt, common requests). Caching responses (exact or semantic) and reusing prefix KV caches (Chapter 27) means NOT running the model at all for cache hits — the cheapest possible token is one you never generate. For workloads with repetition, caching can cut costs substantially with no quality loss.
Combine the levers and the savings multiply: quantization + batching make each token cheaper, routing + caching avoid expensive or redundant tokens, and autoscaling avoids paying for idle GPUs. Cost optimization at scale is about stacking many such savings, each modest, into a large total reduction.
31.10

A production service must be OBSERVABLE — you must be able to see what it is doing, detect problems quickly, and diagnose them. Monitoring and observability are what let you operate a service reliably, catch issues before users do, and understand what is happening across a large fleet.

What to Monitor

CategoryWhat to watch
PerformanceLatency (p50/p95/p99), TTFT, TPOT, throughput
ReliabilityError rates, availability, failed/timed-out requests
CapacityGPU utilization, queue depth, KV-cache usage, saturation
CostTokens served, cost per token, spend by feature/customer
QualityFeedback signals, refusal rates, output anomalies
TrafficRequest volume, patterns, geographic distribution

Alerting and Dashboards

Monitoring data feeds two things: DASHBOARDS (live views of the system's health for humans to inspect) and ALERTS (automatic notifications when something crosses a threshold — latency spiking, errors rising, a replica down). Good alerting catches problems early, ideally before users notice; good dashboards let engineers diagnose and resolve them quickly. The aim is to know about and fix issues faster than users experience them.

Monitoring Quality, Not Just Systems

Beyond system metrics (latency, errors), production LLM serving must monitor OUTPUT QUALITY — which is harder. Watch for spikes in refusals (the model suddenly declining too much), output anomalies, drops in user feedback, and shifts in behaviour after a deployment. Quality regressions can be subtle and invisible to system metrics: the service is 'up' and fast, but the model's answers got worse. Monitoring quality signals — not just whether requests succeed — is essential and distinctive to ML serving.

Scale Note: Watch Quality, Not Just Uptime
Traditional service monitoring asks 'is it up and fast?'. LLM serving must also ask 'are the answers still good?'. A model can be perfectly available and fast while silently producing worse outputs — after a bad deployment, a data shift, or a prompt change. Because quality is hard to measure automatically, teams combine proxy signals (feedback rates, refusal rates, output-length distributions, sample audits) to detect quality regressions that system metrics miss.
This is the observability frontier unique to AI products: monitoring not just the SYSTEM but the MODEL'S BEHAVIOUR. It ties back to evaluation (Section 31.7 and Chapter 21) — production quality monitoring is continuous evaluation on live traffic, the last line of defense against shipping a regression to everyone.
31.11

A public LLM service faces security, privacy, and abuse challenges beyond performance and reliability. Serving at scale means defending against misuse, protecting user data, and enforcing usage policies — responsibilities that grow with the service's reach.

ConcernDefense
Abuse / misuseSafety filtering (Ch. 26), usage policies, monitoring, bans
Prompt injectionTreat inputs as untrusted; sandbox tools (Ch. 28)
Data privacyEncrypt data; minimize retention; honor deletion; isolate tenants
DoS / overloadRate limiting, quotas, authentication
Data leakagePrevent one user's data leaking to another; careful caching
Cost attacksQuotas and limits so abuse can't run up huge bills

Privacy and Multi-tenancy

When a service handles many users' or organizations' data ('multi-tenant'), strict ISOLATION is essential: one tenant's data, cache entries, and context must never leak to another. This shapes caching (don't share caches across tenants carelessly), logging (be careful what you store), and data handling (encryption, retention limits, deletion on request). Privacy is both an ethical obligation and, increasingly, a legal requirement.

Rate Limiting and Quotas

Rate limiting protects the service from overload and abuse, and controls cost. By capping how many requests or tokens a user can consume per time window, the service prevents any single user (malicious or buggy) from overwhelming capacity or running up unbounded cost. Tiered quotas (more for paying customers) also implement the business model. Rate limiting is a basic but essential layer of any public LLM API.

Scale Note: Safety Is Part of Serving
The safety techniques of Chapter 26 (content filtering, refusal, harmlessness) are not just training concerns — they are enforced at SERVING time too. A production system layers input and output safety filters around the model, monitors for abuse patterns, and enforces usage policies. Serving safely at scale means combining the model's trained-in safety with system-level guardrails: filtering, monitoring, rate limits, and the ability to respond quickly to newly-discovered misuse.
This completes the safety picture from Part V: training instills safety into the model, and serving enforces it operationally. Neither alone is sufficient — robust real-world safety comes from defense in depth across both the model and the system that serves it.
31.12

Let us assemble the whole chapter — and much of Part VI — into a picture of a complete production LLM serving stack, from the user's request to the response and back.

Pipeline Flow: A request through the full production stack

1GatewayAuthentication, rate limiting, input safety filtering
2RouterCache-aware + model routing to the right replica/model
3Load balancerSpread across healthy replicas in the autoscaled fleet
4Serving enginevLLM: paged KV cache, continuous batching, quantized (Ch. 27)
5Generate & streamTokens streamed back through output safety filtering
6ObserveLog latency, cost, quality signals; alert on anomalies

The Layers of a Serving System

Arch Stack: The production serving stack, layer by layer

API / Gatewayauth, rate limits, safety filtering, versioning
Routing & load balancingcache-aware, model routing, autoscaling
Serving enginesvLLM/TGI: batching, paging, quantization
GPU fleetmany replicas across regions, with failover
Observabilitymonitoring, alerting, A/B testing, cost tracking
Scale Note: It All Composes — and It's a Team Sport
The production stack layers everything from Part VI: the per-request optimizations of Chapter 27 (inside the serving engine), the tool and RAG capabilities of Chapters 28–29, the multi-modal handling of Chapter 30, and this chapter's systems engineering around them. Each layer has a job, and together they turn a model into a service that is fast, reliable, safe, and affordable at scale.
And building this is a TEAM effort spanning ML, systems, infrastructure, and operations — a reminder that deploying AI in the real world is a multidisciplinary engineering endeavor, not just a modeling exercise. The model is where it starts; the production stack is what makes it matter to real users.
31.13

Serving-at-Scale Quick-Reference

ConceptKey ideaRemember
Scale = systemsServing is a systems problemThe model is the easy part
API designStream tokens; stateless serversStreaming = responsive UX
Load balancingSpread load across replicasLLM-aware, not round-robin
AutoscalingMatch capacity to demandCold starts → scale predictively
SLAsPromise & measure servicePercentiles, not averages
RoutingSend to the right replica/modelCache-aware + model routing
VersioningSafe, reversible rolloutsAlways be able to roll back
A/B testingMeasure real-user impactOnline + offline complement
ReliabilitySurvive failuresGraceful degradation; careful retries
CostLower cost per tokenQuantize, batch, route, cache

Exercises

Exercises 1–10 are pen-and-paper or design; 11–20 require code.

Exercise 1: Pen & Paper
Explain why serving at scale is a systems problem, not a model problem. List five things that change between serving one request and serving millions.
Exercise 2: Pen & Paper
Why is streaming essential for interactive LLM products? Connect it to TTFT (Chapter 27) and perceived responsiveness.
Exercise 3: Pen & Paper
Explain why stateless servers scale better than stateful ones. What trade-off do they create, and how does prefix caching address it?
Exercise 4: Pen & Paper
Why does round-robin load balancing work poorly for LLMs? Describe an LLM-aware load-balancing strategy and why it's better.
Exercise 5: Pen & Paper
Explain the cold-start problem in LLM autoscaling and why it forces predictive rather than purely reactive scaling.
Exercise 6: Pen & Paper
Why are SLAs written in percentiles (p95/p99) rather than averages? Construct an example where the average looks fine but the tail is terrible.
Exercise 7: Pen & Paper
Describe cache-aware routing and model routing. How does each improve latency or cost?
Exercise 8: Pen & Paper
Compare canary, blue-green, and shadow rollout strategies. Why must you always be able to roll back?
Exercise 9: Pen & Paper
Explain A/B testing for models. Why is it the gold standard over offline benchmarks, and how do the two complement each other?
Exercise 10: Pen & Paper
Explain how naive retries can cause a retry storm, and how exponential backoff with jitter prevents it.
Exercise 11: Code
Build a streaming LLM API endpoint that sends tokens as server-sent events. Measure the perceived latency improvement vs returning the full response.
Exercise 12: Code
Implement an LLM-aware load balancer that routes to the replica with the lowest current load (queue depth or active requests). Compare its tail latency to round-robin under variable request sizes.
Exercise 13: Code Lab
Simulate autoscaling: model a fluctuating request stream and an autoscaler with a cold-start delay. Compare reactive vs predictive scaling on latency and cost.
Exercise 14: Code
Compute and plot p50, p95, and p99 latency from a stream of request timings. Show how an average can hide a bad tail.
Exercise 15: Code
Implement cache-aware routing: route follow-up requests in a conversation to the replica that holds the cached prefix, and measure the TTFT improvement.
Exercise 16: Code
Implement model routing: a classifier (or heuristic) sends easy queries to a small model and hard ones to a large model. Measure the cost savings and any quality change.
Exercise 17: Code Lab
Simulate an A/B test: assign simulated users to model A or B, generate outcome metrics with a real difference, and run a significance test to decide if B is better.
Exercise 18: Code
Implement retries with exponential backoff and jitter. Simulate a struggling service and show that naive retries cause a storm while backoff allows recovery.
Exercise 19: Code
Build a cost dashboard: track tokens served and cost per token across simulated replicas, broken down by request type. Identify the biggest cost driver.
Exercise 20: Code (Challenge)
Build a mini serving system that ties Part VI together: a streaming API gateway with rate limiting, an LLM-aware load balancer over several simulated replicas (each using continuous batching from Chapter 27), cache-aware and model routing, autoscaling with cold-start modeling, p99-latency and cost-per-token monitoring, and a safe canary rollout of a 'new model'. Drive it with a realistic fluctuating workload and report how latency, throughput, and cost respond — then deliberately fail a replica and show your failover keeps the SLA.

Further reading: “Orca: A Distributed Serving System for Transformer-Based Generative Models” (Yu et al., 2022). The vLLM, TensorRT-LLM, and Ray Serve documentation for production serving. Google's Site Reliability Engineering (SRE) book for SLAs, monitoring, and reliability principles. “The Tail at Scale” (Dean & Barroso, 2013) for tail-latency engineering. Literature on A/B testing and online experimentation (e.g. Kohavi et al.). Cloud providers' guidance on GPU autoscaling and cost optimization for inference workloads.

Part VI Complete: Inference, Tools & Deployment

Ch. 27Inference OptimizationKV cache, quantization, PagedAttention, continuous batching, speculative decoding — making a model fast and cheap.
Ch. 28Tool Calling & Function UseJSON-schema tools, structured output, ReAct, reliable agents — letting the model act in the world.
Ch. 29Retrieval-Augmented Generationdense retrieval, vector DBs, chunking, hybrid search, reranking — grounding answers in external knowledge.
Ch. 30Multi-modal LLMsvision encoders, CLIP, LLaVA, audio — teaching the model to see and hear via a shared embedding space.
Ch. 31Serving at ScaleAPI design, load balancing, SLAs, versioning, A/B testing, cost — turning the model into a dependable service.

You have now taken a model all the way from raw mathematics to a deployed, scalable, multi-modal, tool-using service. Across six Parts you have built the foundations (Part I), classical methods (Part II), the Transformer (Part III), pretraining (Part IV), alignment (Part V), and deployment (Part VI). Part VII — Frontier Techniques — turns to the cutting edge and the open horizon: Mixture-of-Experts architectures that scale models efficiently (Chapter 32), long-context and memory methods that extend how much a model can attend to (Chapter 33), agents and multi-agent systems that push tool use to its limits (Chapter 34), and the open problems that remain unsolved at the frontier of the field (Chapter 35). Having mastered how LLMs work and how to deploy them, you are ready to explore where they are going — and to contribute to what comes next.

20 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →