Retrieval-Augmented Generation
Detailed solutions for the exercises in Chapter 29. Try solving them yourself before checking the answers.
Solution
RAG solves: stale knowledge (post-cutoff facts), private/enterprise data the model never trained on, hallucination (grounding answers in sources), and provenance (citations). Open-book-exam analogy: instead of memorizing everything (closed-book, the model alone), the model looks up relevant passages and answers from them (open book). Retrieval quality dominates because if the right passage isn't retrieved, even a perfect model cannot answer correctly — garbage in, garbage out; the answer can only be as good as the retrieved context.
Solution
Offline (indexing): documents are chunked, embedded, and stored in a vector index. Online (query time): the query is embedded, the top-k chunks retrieved, inserted into the prompt, and the model generates a grounded answer. When answers are wrong, suspect RETRIEVAL first because the most common failure is that the relevant chunk wasn't retrieved — if the context lacks the answer, the generation step cannot succeed. Check recall before blaming the model.
Solution
Dense retrieval embeds query and documents into vectors capturing MEANING, then retrieves by vector similarity. It can match text sharing no words because semantically similar phrases ('car' vs 'automobile') map to nearby vectors regardless of surface form. Cosine similarity measures the angle between embeddings — how aligned their semantic directions are — so it scores conceptual relatedness, not lexical overlap (unlike keyword search).
Solution
Dense wins on paraphrase/synonymy ('how to fix a flat' → 'repairing a punctured tire'). Sparse (BM25) wins on exact terms, rare identifiers, or codes ('error E1738', a product SKU, a person's exact name) where embeddings may blur specifics. You want BOTH (hybrid) for queries mixing concepts and exact terms — e.g. 'side effects of drug XR-450' needs semantic understanding of 'side effects' AND exact matching of 'XR-450'.
Solution
Exact (brute-force) search compares the query to EVERY vector — O(N) per query, infeasible for millions/billions of vectors. Approximate Nearest Neighbor (ANN) methods (HNSW, IVF) build an index that finds the likely-nearest vectors without checking all, in roughly logarithmic time. The trade-off: ANN may miss some true neighbors (recall < 100%) in exchange for huge speedups; tuning the index trades recall against latency — you accept slightly imperfect retrieval for tractable speed.
Solution
Too-small chunks lose context (a sentence without its surroundings is ambiguous); too-large chunks dilute relevance (the embedding averages many topics, and irrelevant text crowds the context). Overlap helps because an answer spanning a chunk boundary is preserved in at least one chunk. Natural-boundary chunking (by paragraph/section) beats fixed-size because it keeps semantically coherent units intact, producing cleaner embeddings and more self-contained retrieved passages.
Solution
Hybrid search runs both dense and sparse retrieval and combines their results. Reciprocal Rank Fusion (RRF) scores each document by Σ 1/(k + rank) across the retrievers and re-sorts. RRF uses RANKS rather than raw scores because dense (cosine) and sparse (BM25) scores live on incomparable scales — fusing raw scores would let one method's larger numeric range dominate. Ranks are comparable across methods, so RRF combines them fairly without score normalization.
Solution
Reranking re-scores the retrieved candidates with a more powerful model to reorder them. A bi-encoder embeds query and document SEPARATELY (fast, enables indexing) but never lets them interact. A cross-encoder feeds query and document TOGETHER through a Transformer, so attention models their fine-grained interaction — far more accurate, but too slow to run over the whole corpus. So it is applied only to the top candidates from first-stage retrieval: cheap retrieval narrows to dozens, then the expensive cross-encoder picks the best few.
Solution
Lost-in-the-middle: models attend best to the start and end of the context and worst to the middle, so a relevant chunk buried in the middle may be ignored. Mitigations: (1) reorder retrieved chunks to place the most relevant at the start/end; (2) retrieve fewer, higher-quality chunks (rerank and trim). More context is not always better because adding marginally-relevant chunks pushes key information into the neglected middle and distracts the model — a focused context often beats a larger one.
Solution
RAG: best for large, changing, or private knowledge needing citations — update the index, not the model. Fine-tuning: best for teaching STYLE, FORMAT, or skills (behavior), not volatile facts. Long context: best when the relevant info is bounded and fits the window, and you want simplicity. They combine well: fine-tune the model to use retrieved context effectively and adopt the right style, while RAG supplies the up-to-date facts — fine-tuning for HOW, RAG for WHAT.
Solution
Embedding the chunks and query with a sentence encoder, computing cosine similarities, and returning the top-k implements the core retrieval step (Exercise 3) — the minimal RAG retriever, matching meaning rather than keywords.
Solution
A Flat index gives exact search but scales linearly; HNSW gives near-exact results far faster. Comparing them shows HNSW achieving large speedups at slightly reduced recall — the ANN trade-off of Exercise 5 made concrete.
Solution
Comparing fixed-size, fixed-with-overlap, and sentence-based chunking on a document set typically shows overlap and natural-boundary strategies retrieving more relevant, self-contained chunks (Exercise 6) — demonstrating that chunking choices materially affect retrieval quality.
Solution
BM25 captures exact-term matches; fusing it with dense retrieval via RRF (Exercise 7) recovers queries that pure dense misses (rare identifiers) and pure sparse misses (paraphrases). A query with both an exact code and a paraphrased concept shows hybrid beating either alone (Exercise 4).
Solution
Retrieving 50 candidates cheaply then reranking with a cross-encoder to the top 5 (Exercise 8) yields more relevant final passages than retrieval's raw top 5, because the cross-encoder models query-document interaction — the accuracy gain that justifies the two-stage retrieve-then-rerank design.
Solution
Instructing the model to answer only from the retrieved context, cite sources, and say 'I don't know' when the context lacks the answer prevents it from fabricating when retrieval fails — demonstrating that grounding instructions, plus an abstention clause, curb hallucination (Exercise 1).
Solution
Placing the answer-bearing chunk at different positions and measuring whether the model uses it reproduces the lost-in-the-middle effect (Exercise 9): accuracy is highest when the chunk is at the start or end and drops in the middle — motivating relevance-aware context ordering.
Solution
Assembling the full offline+online pipeline (Exercise 2) and answering questions while inspecting the retrieved chunks shows the system grounding its answers in real passages — and makes debugging easy, since you can see whether a wrong answer came from bad retrieval or bad generation.
Solution
Measuring recall@k (did the right chunk get retrieved?) and faithfulness (does the answer only claim what sources support?) SEPARATELY localizes failures — low recall is a retrieval problem, low faithfulness with good recall is a generation problem (Exercise 2). Separate metrics are essential for diagnosing and improving RAG.
Solution
Using the model to expand or clarify a terse query before retrieval (adding context/synonyms) improves recall on under-specified questions — the retriever gets a richer query to match against, surfacing relevant chunks the original terse query missed.
Solution
Treating retrieval as a tool (Chapter 28) lets the model decide when and what to search, issuing multiple retrievals for a multi-hop question (retrieve fact A, then use it to retrieve fact B). This adaptive retrieval handles questions a single fixed retrieval cannot — the bridge from RAG to agents.
Solution
Building the full pipeline and then DEGRADING each stage in turn (bad chunking, no reranking, no hybrid) while measuring recall and faithfulness quantifies each component's contribution — typically reranking and hybrid retrieval give large gains, and good chunking is foundational. The ablation shows RAG quality is the product of many stages, each worth getting right (Exercise 2's 'suspect retrieval first' at system scale).