Part VI: Productionization

Chapter 29

Retrieval-Augmented Generation

Grounding a model in external knowledge: dense retrieval and embeddings, vector databases and FAISS, chunking strategies, hybrid search, reranking, and getting the right context into the prompt.

22 Exercises

Learning Objectives

1.	Explain why RAG exists and what problems it solves.
2.	Understand the retrieve-then-generate pipeline end to end.
3.	Use embeddings and dense retrieval to find relevant text by meaning.
4.	Understand vector databases and approximate nearest-neighbor search (FAISS).
5.	Choose good chunking strategies for splitting documents.
6.	Combine dense and keyword search with hybrid retrieval.
7.	Improve results with reranking.
8.	Assemble retrieved context into an effective prompt.
9.	Diagnose and fix common RAG failure modes.
10.	Evaluate a RAG system's retrieval and generation quality.

In Chapter 28 the model gained the ability to call tools. One of the most valuable things to fetch is KNOWLEDGE — facts, documents, and data the model was never trained on or cannot reliably recall. Retrieval-Augmented Generation (RAG) is the technique for grounding a model's answers in an external knowledge source: retrieve the relevant text, put it in the prompt, and let the model answer based on it. RAG is one of the most widely-used LLM techniques in production, and this chapter builds it from the ground up.

The Problems RAG Solves

Problem	Without RAG	With RAG
Stale knowledge	Frozen at training cutoff	Retrieve current documents
No private data	Never saw your files	Retrieve from your knowledge base
Hallucination	Invents plausible facts	Grounds answers in real text
No sources	Can't cite anything	Cites the retrieved documents
Limited memory	Can't hold all knowledge	Stores knowledge externally

The Core Idea: Look It Up, Then Answer

RAG mirrors how a careful person answers a hard question: rather than answering from memory alone, they LOOK IT UP in a reliable source, then answer based on what they found. RAG gives a model the same workflow. Given a question, the system first RETRIEVES the most relevant pieces of text from a knowledge base, inserts them into the prompt, and asks the model to answer USING that retrieved context. The model's broad language ability is combined with specific, up-to-date, trustworthy information.

Retrieval-Augmented Generation (RAG)

A technique that improves a model's answers by retrieving relevant text from an external knowledge source and providing it in the prompt, so the model generates answers grounded in that retrieved information rather than from its weights alone.

✧

RAG Note: RAG = Open-Book Exam

The cleanest analogy: a model without RAG is taking a CLOSED-BOOK exam — it must answer from memory, and it may misremember or make things up. RAG turns it into an OPEN-BOOK exam — the model can consult the relevant pages before answering. Just as open-book exams produce more accurate, citable answers, RAG produces answers grounded in real, current, source-able text.

And like a real open-book exam, the quality depends on finding the RIGHT pages quickly. A student who can't locate the relevant passage does poorly even with the book open. Most of this chapter is about the retrieval half — finding the right text — because that is what makes or breaks a RAG system.

RAG has two phases. An OFFLINE phase prepares the knowledge base (done once, ahead of time), and an ONLINE phase answers each query (done per question). Understanding the two phases and how they fit together is the foundation for everything else.

Offline: Building the Index

Pipeline Flow: Offline phase: prepare the knowledge base (done once)

1	Collect	Gather your documents (files, web pages, database records)
2	Chunk	Split documents into smaller passages (Section 29.5)
3	Embed	Convert each chunk into a vector capturing its meaning (Section 29.3)
4	Index	Store the vectors in a vector database for fast search (Section 29.4)

Online: Answering a Query

When a question arrives, the system retrieves relevant chunks and generates a grounded answer. This is the retrieve-then-generate flow — the heart of RAG:

Tool Trace: Online phase: the retrieve-then-generate flow

User	What is our company's parental leave policy?	→
Embed	Convert the question into a query vector	•
Retriever	Find the most similar chunks in the vector DB	→
Vector DB	Returns top-k relevant passages from the HR handbook	←
App	Stuffs the retrieved passages into the prompt as context	•
Model	Generates an answer grounded in the retrieved policy text	•
User	'Employees get 16 weeks of paid leave... [grounded in the handbook]'	←

text•The RAG pipeline (Pseudocode)
# OFFLINE (once): build the index
for each document: chunk it, embed each chunk, store vectors in the index

# ONLINE (per query): retrieve then generate
1. embed the user's query into a vector
2. retrieve the top-k most similar chunks from the index
3. (optional) rerank the chunks for relevance
4. build a prompt: question + retrieved chunks as context
5. the model generates an answer grounded in that context

✧

RAG Note: Two Halves: Retrieval and Generation

RAG has a RETRIEVAL half (find the right text) and a GENERATION half (write a good answer from it). Beginners often focus on the generation — the model — but in practice, RETRIEVAL quality dominates RAG success. If retrieval surfaces the wrong passages, even the best model produces a wrong or ungrounded answer ('garbage in, garbage out'). Most of the engineering effort, and most of this chapter, is on retrieval.

A useful rule of thumb: if your RAG system gives bad answers, suspect retrieval first. Check whether the right passages were actually retrieved before blaming the model. More often than not, the model answered correctly given what it was handed — the problem was that it was handed the wrong context.

The heart of modern retrieval is the EMBEDDING — a vector that captures the MEANING of a piece of text, building directly on the embeddings of Chapter 8. Dense retrieval uses embeddings to find text by SEMANTIC similarity: passages that MEAN the same thing as the query, even if they share no words. This is what lets RAG find 'parental leave' content when you ask about 'maternity time off'.

From Text to Vectors

An embedding model converts any piece of text into a fixed-length vector — a list of numbers — positioned so that texts with SIMILAR MEANING have NEARBY vectors. 'How do I reset my password?' and 'I forgot my login credentials' land close together in the vector space, despite sharing almost no words, because they mean nearly the same thing. This semantic matching is the superpower of dense retrieval over keyword search.

Shape Trace: Embedding text for retrieval

Operation	Shape	Note
query text	"reset password"	raw string
embedding model	→	encode
query vector	(768,)	captures meaning
compare to chunk vectors	(N, 768)	cosine similarity
top-k nearest	(k, 768)	most similar chunks

Measuring Similarity

To find the chunks most similar to the query, we compare their vectors. The standard measure is COSINE SIMILARITY — the cosine of the angle between two vectors (Chapter 1) — which is high when vectors point in the same direction (similar meaning) and low when they don't. Retrieval finds the k chunks whose vectors are most similar to the query vector.

text•Cosine similarity for retrieval
similarity(q, c) = (q · c) / (||q|| · ||c||)

q = query vector,  c = chunk vector
= 1.0  when meanings are identical (vectors aligned)
= 0.0  when unrelated (vectors orthogonal)

# Retrieve the k chunks with the HIGHEST similarity to the query.

Python•Dense retrieval from scratch
import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# OFFLINE: embed all chunks once
chunks = ['Reset your password in Settings.', 'Office hours are 9-5.', ...]
chunk_vecs = embedder.encode(chunks, normalize_embeddings=True)  # (N, 768)

def retrieve(query, k=3):
    """Return the k chunks most similar to the query."""
    q = embedder.encode(query, normalize_embeddings=True)  # (768,)
    # Cosine similarity = dot product (vectors are normalized)
    scores = chunk_vecs @ q                       # (N,)
    top = np.argsort(-scores)[:k]                 # k highest
    return [(chunks[i], scores[i]) for i in top]

retrieve('I forgot my login')  # -> finds the password-reset chunk,
                          #    despite sharing no words. Semantic match!

✧

RAG Note: Dense vs Sparse Retrieval

Traditional 'sparse' retrieval (like BM25) matches on KEYWORDS — it finds documents containing the query's words. 'Dense' retrieval matches on MEANING via embeddings. Dense retrieval handles synonyms and paraphrases that keyword search misses ('car' vs 'automobile'), but keyword search is unbeatable for exact terms, names, codes, and rare words that embeddings may blur. Each has strengths — which is why hybrid search (Section 29.6) combines them.

The embedding model matters enormously: a better embedding model places semantically-related texts closer together, directly improving retrieval. Choosing and sometimes fine-tuning the embedding model for your domain is one of the highest-leverage decisions in a RAG system.

Computing cosine similarity against every chunk works for a few thousand chunks, but real knowledge bases have millions or billions. Comparing the query to every single vector (a 'brute-force' search) becomes too slow. VECTOR DATABASES solve this with specialized indexes that find the nearest vectors quickly — the infrastructure that makes RAG scale.

The Problem: Exact Search Is Too Slow

Brute-force search compares the query against all N chunk vectors — O(N) per query. For millions of vectors and many queries per second, this is prohibitively slow. We need a way to find the nearest vectors WITHOUT comparing against all of them. The trick is to accept APPROXIMATE answers: find vectors that are ALMOST certainly among the nearest, far faster than guaranteeing the exact nearest.

Approximate Nearest Neighbor (ANN)

A search that finds vectors very close to the query without exhaustively comparing against all of them, trading a small chance of missing the true nearest neighbor for dramatically faster search.

How ANN Indexes Work

ANN indexes organize vectors so that search can skip most of them. Two popular families: (1) HNSW (Hierarchical Navigable Small World) builds a navigable graph where search hops toward the query through a few well-connected nodes; (2) IVF (Inverted File) clusters vectors and searches only the nearest clusters. Both turn an O(N) scan into something far faster, at the cost of occasionally missing a true nearest neighbor.

Index type	How it works	Trade-off
Flat (brute force)	Compare against every vector	Exact, but O(N) slow
IVF	Cluster, search nearest clusters only	Fast, may miss some
HNSW	Navigable graph, hop toward query	Fast, high recall, more memory
PQ (quantization)	Compress vectors to save memory	Smaller, slight accuracy loss

FAISS: A Vector Search Library

FAISS (Facebook AI Similarity Search) is a widely-used library implementing these indexes. It is a LIBRARY (you embed it in your application), as opposed to managed vector DATABASES (Pinecone, Weaviate, Qdrant, Milvus, pgvector) that add storage, metadata filtering, and operations. For learning and many applications, FAISS is the workhorse; for production with persistence and scaling needs, a managed vector database often fits better.

Python•Vector search with FAISS
import faiss; import numpy as np

# OFFLINE: build the index from chunk vectors (N, dim)
dim = 768
index = faiss.IndexFlatIP(dim)           # inner product = cosine (normalized)
index.add(chunk_vecs)                    # add all chunk vectors

# For millions of vectors, use an ANN index instead of Flat:
# index = faiss.IndexHNSWFlat(dim, 32)    # HNSW graph index

# ONLINE: search for the k nearest chunks to a query vector
scores, ids = index.search(query_vec[None], k=5)
results = [chunks[i] for i in ids[0]]

# Flat is exact but O(N); HNSW/IVF are approximate but scale to millions.
# Managed DBs (Pinecone, Qdrant, pgvector) add persistence + filtering.

✧

Scale Note: Metadata Filtering Matters

Real RAG systems rarely search ALL vectors — they filter by METADATA first. You might restrict the search to a specific user's documents, a date range, a department, or a document type, THEN do vector search within that subset. This 'filtered vector search' is essential for correctness (don't retrieve another user's data) and relevance (search only the right corpus). It is a key feature distinguishing production vector databases from a bare FAISS index.

Combining metadata filters with vector similarity is also a performance question: filtering first shrinks the search space, but must be done carefully so the ANN index still works efficiently. Production vector databases are largely about doing this well at scale.

Before documents can be embedded and indexed, they must be split into CHUNKS — smaller passages. Chunking is deceptively important: get it wrong and even great embeddings and search produce poor results, because the units being retrieved are the wrong size or split mid-thought. Beginners often overlook chunking; experts know it is one of the biggest levers on RAG quality.

Why Chunk At All?

Two reasons. First, embeddings work best on focused passages — embedding an entire 50-page document into one vector blurs its many topics into mush, so retrieval can't distinguish them. Second, the model's context window is finite — you can only fit so much retrieved text in the prompt, so you want to retrieve focused, relevant passages, not whole documents. Chunking creates retrievable units that are focused enough to embed well and small enough to fit in context.

The Chunking Trade-off

Chunks too small	Chunks too large
Lose surrounding context	Dilute the relevant part
Answer split across chunks	Embedding blurs many topics
Retrieve fragments	Waste context-window space
Miss the full picture	Retrieve irrelevant text too
e.g. single sentences	e.g. whole documents

The art is finding a chunk size that is focused enough to embed and retrieve precisely, yet large enough to contain a complete idea. There is no universal answer — it depends on your documents and queries — but a few hundred tokens with some overlap is a common starting point.

Chunking Strategies

Strategy	How it splits
Fixed-size	Every N tokens/characters — simple but may split mid-sentence
With overlap	Fixed-size, but chunks overlap so context isn't lost at boundaries
Sentence/paragraph	Split on natural boundaries — keeps ideas intact
Recursive	Try paragraphs, then sentences, then words to hit a target size
Semantic	Split where the topic shifts (detected by embedding changes)
Structure-aware	Respect document structure (headings, sections, code blocks)

Python•Chunking with overlap
def chunk_with_overlap(text, size=400, overlap=50):
    """Split text into overlapping chunks (in tokens/words)."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + size
        chunks.append(' '.join(words[start:end]))
        # Step forward by (size - overlap) so chunks OVERLAP
        start += size - overlap
    return chunks

# Overlap ensures an idea spanning a chunk boundary appears WHOLE
# in at least one chunk -- so it can be retrieved intact. Better still:
# split on paragraph/sentence boundaries to avoid cutting mid-thought.

✧

RAG Note: Overlap Prevents Boundary Loss

A key trick: make chunks OVERLAP slightly. Without overlap, an idea that straddles a chunk boundary gets split — half in one chunk, half in the next — and neither chunk contains the complete thought, so retrieval may miss it. With overlap, the boundary region appears in full in at least one chunk. A modest overlap (10–20% of chunk size) is cheap insurance against boundary loss.

Even better than fixed-size-with-overlap is splitting on NATURAL boundaries (paragraphs, sections), so chunks contain complete ideas by construction. Structure-aware chunking that respects headings and sections often outperforms naive fixed-size splitting, especially for well-structured documents.

Dense (semantic) retrieval and sparse (keyword) retrieval each have strengths and weaknesses. HYBRID SEARCH combines them to get the best of both — and in practice, hybrid search usually beats either alone. Understanding why requires seeing exactly where each method shines and fails.

Where Each Method Wins and Fails

Query type	Dense wins	Keyword wins
Synonyms / paraphrases	✓ strong	✗ misses
Exact terms / codes / IDs	✗ may blur	✓ exact
Names, acronyms, rare words	✗ weak	✓ strong
Conceptual / fuzzy questions	✓ strong	✗ weak
Typos / morphology	varies	✗ brittle

The pattern is clear: dense retrieval excels at MEANING (synonyms, concepts, fuzzy questions), while keyword search excels at EXACTNESS (specific terms, names, codes, rare words). A query like 'error code TX-409 in the billing module' needs BOTH — the exact code (keyword) and the conceptual context of billing errors (dense). Hybrid search runs both and combines the results.

Combining the Scores

The standard way to merge dense and keyword results is RECIPROCAL RANK FUSION (RRF): each method ranks the chunks, and a chunk's combined score is based on its RANK in each list (not its raw score, which aren't comparable across methods). Chunks that rank highly in EITHER method bubble to the top. This is simple, robust, and avoids the problem of dense and keyword scores being on different scales.

text•Reciprocal Rank Fusion (RRF)
score(chunk) = Σ   1 / (k + rank_in_list)
             lists

k ≈ 60 (a smoothing constant)
rank_in_list = the chunk's position in each method's ranking

# Chunks ranked highly by EITHER dense or keyword search score well.
# Uses RANKS, not raw scores -- so the two methods' scales don't matter.

Python•Hybrid search with reciprocal rank fusion
def reciprocal_rank_fusion(dense_ids, keyword_ids, k=60):
    """Fuse two ranked lists into one combined ranking."""
    scores = {}
    for ranked_list in [dense_ids, keyword_ids]:
        for rank, doc_id in enumerate(ranked_list):
            # Earlier rank (smaller index) -> bigger contribution
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    # Sort by combined score, highest first
    return sorted(scores, key=scores.get, reverse=True)

dense   = dense_retrieve(query, k=20)     # semantic results
keyword = bm25_retrieve(query, k=20)      # keyword (BM25) results
final   = reciprocal_rank_fusion(dense, keyword)[:10]

# Best of both: semantic understanding + exact-term matching.

✧

RAG Note: Hybrid Is the Production Default

In production RAG, hybrid search is usually the right default. The cost of running both a dense and a keyword search is modest, and the robustness gain is large — you stop failing on the queries where one method alone is weak (exact codes for dense, paraphrases for keyword). Most mature vector databases support hybrid search natively, fusing dense and sparse results for you.

The lesson generalizes: when two methods have complementary strengths and combining them is cheap, combine them. Dense and keyword retrieval are not rivals — they are partners that cover each other's blind spots.

Initial retrieval (dense, keyword, or hybrid) is fast but coarse — it casts a wide net to find candidate chunks. RERANKING adds a second, more precise pass: take the top candidates from retrieval and re-score them with a more powerful (but slower) model that judges relevance more accurately. This two-stage 'retrieve-then-rerank' approach significantly improves the quality of the final context.

Why a Second Pass Helps

Fast retrieval embeds the query and each chunk SEPARATELY, then compares vectors — the chunk's vector was computed without ever seeing the query. A RERANKER instead looks at the query and a candidate chunk TOGETHER, in one model pass, judging how well that specific chunk answers that specific query. This 'cross-encoder' approach is far more accurate, but too slow to run over millions of chunks — so it is applied only to the top candidates from fast retrieval.

Arch Stack: Two-stage retrieval: retrieve wide, then rerank precisely

Final top-k (e.g. 5)	fed to the model as context
Reranker (cross-encoder)	scores query+chunk together, accurately
Candidates (e.g. top 50)	from fast retrieval
Fast retrieval (bi-encoder / ANN)	over millions of chunks

Bi-encoder (retrieval)	Cross-encoder (reranking)
Embeds query & chunk separately	Processes query + chunk together
Vectors precomputed offline	Computed per query-chunk pair
Fast — scales to millions	Slow — only for top candidates
Coarse relevance	Precise relevance
Stage 1: cast a wide net	Stage 2: pick the best few

Python•Reranking retrieved candidates
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query, k_retrieve=50, k_final=5):
    # Stage 1: fast retrieval casts a wide net
    candidates = retrieve(query, k=k_retrieve)    # 50 candidate chunks

    # Stage 2: rerank by scoring (query, chunk) pairs together
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)            # precise relevance scores

    # Keep the k_final best after reranking
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, s in ranked[:k_final]]

# Retrieve 50 cheaply, rerank to the best 5 precisely. The 5 that
# actually go in the prompt are far more relevant than raw retrieval's top 5.

✧

RAG Note: Reranking Is a High-ROI Addition

Adding a reranker is one of the most reliable ways to improve a RAG system's quality. Fast retrieval optimizes for RECALL (don't miss relevant chunks) by casting a wide net; the reranker optimizes for PRECISION (put the BEST chunks first) over that smaller set. Because the model only sees the top few reranked chunks, getting the ordering right directly improves answer quality. The extra latency is modest since reranking runs over dozens, not millions, of chunks.

The general pattern — a cheap, high-recall first stage followed by an expensive, high-precision second stage — recurs throughout retrieval and search systems. It lets you combine the scalability of fast methods with the accuracy of slow ones.

Once the best chunks are retrieved and reranked, they must be assembled into the prompt — 'context stuffing'. This step seems trivial but has real subtleties: how much to include, in what order, and how to instruct the model to use it. Done poorly, even perfect retrieval yields bad answers.

Building the Prompt

Python•Assembling retrieved context into a prompt
def build_rag_prompt(query, chunks):
    """Stuff retrieved chunks into a grounded-answer prompt."""
    context = '\n\n'.join(
        f'[Source {i+1}] {chunk}' for i, chunk in enumerate(chunks)
    )
    return f"""Answer the question using ONLY the sources below.
If the answer isn't in the sources, say you don't know.
Cite sources by number.

Sources:
{context}

Question: {query}
Answer:"""

# Key instructions: ground in the sources, admit ignorance, cite.
# These instructions are what make RAG reduce hallucination.

Instructions That Reduce Hallucination

The prompt instructions matter as much as the retrieved text. Three instructions are crucial: (1) 'answer using ONLY the sources' — grounds the model in the retrieved text; (2) 'if the answer isn't in the sources, say you don't know' — prevents the model from filling gaps with hallucination; (3) 'cite sources by number' — makes answers verifiable and traceable. These instructions are what turn retrieved text into trustworthy, grounded answers.

The Lost-in-the-Middle Problem

A surprising and important finding (Liu et al., 2023): models pay MORE attention to information at the BEGINNING and END of the context than in the MIDDLE. If you stuff ten chunks and the crucial one lands in the middle, the model may overlook it — a phenomenon called 'lost in the middle'. This is why reranking matters beyond just filtering: placing the most relevant chunks at the start (and end) of the context, not buried in the middle, improves the model's use of them.

⚠️

More Context Is Not Always Better

A common beginner mistake is stuffing as many chunks as possible into the context, reasoning that more information helps. It often hurts: irrelevant chunks DISTRACT the model, the crucial chunk gets lost in the middle, and you waste context window and money. Quality and ORDERING of context beat quantity. Retrieve widely, rerank precisely, and include only the few BEST chunks, placed where the model will attend to them.

This connects to long-context limits (previewed for Chapter 33): even models with huge context windows do not use all of it equally well. Feeding a model a focused, well-ordered handful of relevant passages reliably beats dumping a hundred mediocre ones — less, but better, is the rule for RAG context.

RAG systems fail in characteristic ways, and — as stressed in Section 29.2 — most failures are in the RETRIEVAL half, not the generation. Knowing the failure modes and their fixes makes debugging systematic.

Failure mode	What happens	Fix
Wrong chunks retrieved	Irrelevant context	Better embeddings, hybrid, rerank
Relevant chunk missed	Answer not in context	Higher k, better chunking, hybrid
Answer split across chunks	No single chunk has it	Larger chunks, overlap, merging
Lost in the middle	Crucial chunk ignored	Rerank; best chunks first/last
Hallucinates anyway	Ignores or overrides context	Stronger grounding instructions
Outdated index	Stale answers	Re-index when documents change
No relevant docs exist	Should say 'I don't know'	'Say I don't know' instruction

Debugging RAG: Check Retrieval First

When a RAG answer is wrong, the systematic first step is to INSPECT WHAT WAS RETRIEVED. Print the chunks that went into the prompt. Usually one of two things is true: either the right chunk was NOT retrieved (a retrieval problem — fix chunking, embeddings, k, or add hybrid/reranking), or the right chunk WAS retrieved but the model didn't use it (a generation problem — fix the prompt instructions or chunk ordering). Distinguishing these two cases tells you exactly where to focus.

✧

RAG Note: Evaluate Retrieval and Generation Separately

Because RAG has two halves, evaluate them separately. For RETRIEVAL, measure whether the relevant chunks were found (recall@k: was the right chunk in the top k?). For GENERATION, measure whether the answer is faithful to the retrieved context (does it only claim what the sources support?) and whether it actually answers the question. A system can have great generation but poor retrieval, or vice versa — separate metrics pinpoint which to fix.

Tools and frameworks (RAGAS and others) automate these RAG-specific metrics: context relevance, answer faithfulness, and answer relevance. Measuring the two halves independently is the key discipline for improving a RAG system methodically rather than by guesswork.

The basic retrieve-then-generate pipeline can be extended in many ways to handle harder cases. These advanced patterns address specific weaknesses of vanilla RAG and are increasingly common in production systems.

Pattern	What it adds
Query rewriting	Rephrase/expand the query before retrieval for better matches
Multi-query	Generate several query variations, retrieve for each, merge
HyDE	Generate a hypothetical answer, embed IT, and retrieve with that
Agentic RAG	The model decides when and what to retrieve, via tool calls
Self-RAG	The model critiques whether retrieval is needed and if results suffice
GraphRAG	Build a knowledge graph; retrieve over entities and relationships
Contextual retrieval	Prepend document context to each chunk before embedding

Query Rewriting and Expansion

Users phrase questions in ways that don't match how documents are written. Query rewriting uses the model to rephrase or expand the query before retrieval — turning a terse 'parental leave?' into 'What is the company parental leave policy, including duration and eligibility?' — which retrieves better. Multi-query retrieval generates several phrasings and merges their results, improving recall.

Agentic RAG: Retrieval as a Tool

The most important modern pattern connects directly to Chapter 28: treat retrieval as a TOOL the model can call. Instead of always retrieving once at the start, the model DECIDES when it needs to look something up, formulates the query itself, and can retrieve multiple times as it works through a problem (the ReAct loop). This 'agentic RAG' handles complex, multi-step questions that a single retrieval cannot — the model retrieves, reasons, retrieves again, and synthesizes.

Tool Trace: Agentic RAG: the model retrieves as needed

User	Compare our 2023 and 2024 revenue and explain the change.	→
Model	I need both years' figures — retrieve(2023 revenue report)	→
Retriever	Returns the 2023 financials chunk	←
Model	Now retrieve(2024 revenue report)	→
Retriever	Returns the 2024 financials chunk	←
Model	Has both — synthesizes the comparison and explanation	←

✧

RAG Note: RAG and Tool Calling Are Converging

Notice how agentic RAG IS tool calling (Chapter 28) with a retrieval tool. The line between 'RAG' and 'an agent that can search' is blurring: modern systems give the model a retrieval tool and let it decide when and what to retrieve, often multiple times, interleaved with reasoning. This is more flexible than a fixed retrieve-once pipeline and handles complex questions better.

The unifying view: retrieval is one of the most important tools you can give a model. Basic RAG is 'always retrieve once at the start'; agentic RAG is 'retrieve whenever the model judges it useful'. As models get better at this judgment, agentic retrieval increasingly subsumes the fixed pipeline.

RAG is one of three ways to get a model to use specific knowledge. The others are FINE-TUNING (bake the knowledge into the weights, Chapter 22) and LONG CONTEXT (just put everything in the prompt). Each has its place, and knowing when to use which is an important practical judgment.

Approach	How knowledge enters	Best when
RAG	Retrieved into the prompt per query	Large, changing knowledge base
Fine-tuning	Baked into the weights	Teaching style/format/skills
Long context	All stuffed into the prompt	Small, fixed knowledge that fits

Why RAG Often Wins for Knowledge

For injecting KNOWLEDGE (facts, documents, data), RAG has decisive advantages over fine-tuning: it updates instantly (just change the index — no retraining), scales to far more knowledge than fits in weights or context, provides citations, and keeps knowledge separate from the model so it's auditable and current. Fine-tuning is better for teaching SKILLS, STYLE, or FORMAT — how to behave — rather than facts to recall.

✧

Compare: RAG vs Fine-Tuning: Knowledge vs Skill

Use RAG to give the model KNOWLEDGE it should look up: your documents, current facts, large or changing information. Knowledge stays external, updatable, and citable.

Use fine-tuning to give the model a SKILL or STYLE it should internalize: a domain's way of writing, a specific output format, a behaviour. These belong in the weights. The two are complementary — fine-tune for how to behave, RAG for what to know.

They Combine

RAG and fine-tuning are not mutually exclusive — the strongest systems often use both: fine-tune the model for the domain's style and the skill of using retrieved context well, AND use RAG to supply current, specific knowledge at query time. And as context windows grow (Chapter 33), the line between RAG and long context shifts — but even with huge contexts, retrieval remains valuable for selecting WHAT to put in the context from a knowledge base far too large to fit entirely.

Let us assemble the whole chapter into a complete, production-minded RAG system, integrating chunking, embedding, indexing, hybrid retrieval, reranking, context assembly, and grounded generation.

Pipeline Flow: A complete RAG system

1	Ingest & chunk	Split documents on natural boundaries, with overlap
2	Embed & index	Embed chunks; store in a vector DB with metadata
3	Hybrid retrieve	Dense + keyword search, fused with RRF
4	Rerank	Cross-encoder picks the best few from the candidates
5	Assemble	Best chunks first/last; grounding + citation instructions
6	Generate	Model answers grounded in context, cites sources
7	Evaluate	Measure retrieval recall and answer faithfulness separately

Python•A complete RAG system (bringing it together)
class RAGSystem:
    def __init__(self, documents):
        # OFFLINE: chunk, embed, index
        self.chunks = [c for d in documents for c in chunk_with_overlap(d)]
        self.vecs = embedder.encode(self.chunks, normalize_embeddings=True)
        self.index = build_faiss_index(self.vecs)
        self.bm25  = build_bm25(self.chunks)        # for hybrid search

    def answer(self, query):
        # 1. Hybrid retrieve (dense + keyword), fuse with RRF
        dense   = faiss_search(self.index, embedder.encode(query), k=30)
        keyword = self.bm25.search(query, k=30)
        candidates = reciprocal_rank_fusion(dense, keyword)[:30]

        # 2. Rerank to the best few
        top = rerank(query, candidates, k_final=5)

        # 3. Assemble grounded prompt and generate
        prompt = build_rag_prompt(query, top)    # cite + 'say I don't know'
        return model.generate(prompt)

# Each stage (chunk, embed, hybrid, rerank, ground) adds robustness.
# Frameworks (LlamaIndex, LangChain) provide this plumbing prebuilt.

✧

RAG Note: Start Simple, Add Stages as Needed

You do not need every stage on day one. Start with the simplest RAG that works: chunk, embed, retrieve top-k, stuff into a grounded prompt. Measure where it fails. Add hybrid search if exact terms are missed; add reranking if the right chunks are retrieved but ranked poorly; add query rewriting if user phrasing is the problem; go agentic if questions need multiple retrievals. Each stage targets a specific failure — add it when the data shows you need it.

This mirrors the engineering discipline from Part V: start simple, measure, and add complexity only where it earns its place. A well-tuned basic RAG often beats an elaborate one built without measurement.

RAG Quick-Reference

Concept	Key idea	Remember
Why RAG	Ground answers in external knowledge	Open-book exam for models
The pipeline	Retrieve, then generate	Retrieval quality dominates
Dense retrieval	Match by meaning via embeddings	Finds synonyms/paraphrases
Vector DB / FAISS	Fast approximate nearest-neighbor	Scales to millions of vectors
Chunking	Split documents into passages	Overlap; natural boundaries
Hybrid search	Dense + keyword, fused (RRF)	Best of meaning + exactness
Reranking	Precise second pass on candidates	Retrieve wide, rerank narrow
Context assembly	Stuff best chunks, instruct to ground	Beware lost-in-the-middle
RAG vs fine-tune	Knowledge vs skill	RAG updates instantly, cites

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–22 require code.

✎

Exercise 1: Pen & Paper

List the problems RAG solves that a model alone cannot. Explain the open-book-exam analogy and why retrieval quality dominates.

✎

Exercise 2: Pen & Paper

Describe the offline and online phases of the RAG pipeline. What happens in each, and why is retrieval the half to suspect first when answers are wrong?

✎

Exercise 3: Pen & Paper

Explain dense retrieval. Why can it find relevant text that shares no words with the query, and what does cosine similarity measure?

✎

Exercise 4: Pen & Paper

Contrast dense and sparse (keyword) retrieval. Give a query where each wins and one where you'd want both.

✎

Exercise 5: Pen & Paper

Why is exact (brute-force) vector search too slow at scale? Explain approximate nearest neighbor and the recall/speed trade-off.

✎

Exercise 6: Pen & Paper

Explain the chunking trade-off (too small vs too large). Why does overlap help, and why might natural-boundary chunking beat fixed-size?

✎

Exercise 7: Pen & Paper

Explain hybrid search and reciprocal rank fusion. Why does RRF use ranks rather than raw scores?

✎

Exercise 8: Pen & Paper

Explain reranking. Why is a cross-encoder more accurate than a bi-encoder, and why is it applied only to the top candidates?

✎

Exercise 9: Pen & Paper

Describe the lost-in-the-middle problem and two ways to mitigate it. Why is 'more context' not always better?

✎

Exercise 10: Pen & Paper

Compare RAG, fine-tuning, and long context for injecting knowledge. When is each best, and how do RAG and fine-tuning combine?

✎

Exercise 11: Code

Implement dense retrieval from scratch: embed a small set of chunks, embed a query, and return the top-k by cosine similarity.

✎

Exercise 12: Code

Build a FAISS index over chunk embeddings and retrieve the top-k for a query. Compare a Flat (exact) index to an HNSW (approximate) index on speed and recall.

✎

Exercise 13: Code

Implement three chunking strategies (fixed-size, fixed-with-overlap, sentence-based) and compare retrieval quality on a small document set.

✎

Exercise 14: Code

Implement BM25 keyword retrieval and combine it with dense retrieval using reciprocal rank fusion. Show a query where hybrid beats either alone.

✎

Exercise 15: Code

Add a cross-encoder reranker: retrieve 50 candidates, rerank to the top 5, and compare the final 5 to retrieval's raw top 5 on relevance.

✎

Exercise 16: Code

Build a RAG prompt with grounding and citation instructions. Demonstrate that the 'say I don't know' instruction prevents hallucination when no relevant chunk exists.

✎

Exercise 17: Code

Demonstrate the lost-in-the-middle effect: place the answer chunk at the start, middle, and end of a long context and measure whether the model uses it.

✎

Exercise 18: Code Lab

Build a complete basic RAG system (chunk, embed, index, retrieve, ground, generate) over a set of documents. Answer questions and inspect the retrieved chunks.

✎

Exercise 19: Code Lab

Implement RAG evaluation: measure retrieval recall@k (was the right chunk retrieved?) and answer faithfulness (does the answer only claim what the sources support?) separately.

✎

Exercise 20: Code

Implement query rewriting: use the model to expand a terse query before retrieval, and show it improves recall on under-specified questions.

✎

Exercise 21: Code Lab

Build agentic RAG: give the model a retrieval tool (Chapter 28) and let it decide when and what to retrieve. Test it on a multi-hop question needing two retrievals.

✎

Exercise 22: Code (Challenge)

Build a full production-style RAG system with overlapping natural-boundary chunking, hybrid (dense + BM25) retrieval fused with RRF, cross-encoder reranking, lost-in-the-middle-aware context ordering, and grounded generation with citations. Then build an evaluation set, measure retrieval recall and answer faithfulness, deliberately degrade each stage (bad chunking, no reranking, no hybrid) and quantify how much each stage contributed to the final quality.

Further reading: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020) — the original RAG paper. “Dense Passage Retrieval” (Karpukhin et al., 2020). “Billing-scale similarity search with GPUs” and the FAISS library (Johnson et al., 2017). “HNSW” (Malkov & Yashunin, 2018) for graph-based ANN. “Lost in the Middle” (Liu et al., 2023). “Precise Zero-Shot Dense Retrieval (HyDE)” (Gao et al., 2022). “Self-RAG” (Asai et al., 2023). “Contextual Retrieval” (Anthropic, 2024). The RAGAS framework for RAG evaluation.

Next → Chapter 30: Multi-modal LLMs

So far our models work entirely in text. But the world is not only text — it is images, audio, and video. Chapter 30 extends LLMs to MULTIPLE MODALITIES: models that can SEE images and HEAR audio, not just read text. We will see how images are turned into tokens the model can process (vision encoders and projection), how text and visual information are fused, how these models are trained, and what they can do — from describing images to answering questions about charts and documents. The text-only assistant becomes one that perceives the world more like we do.

✎ 22 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 28. Tool Calling & Function Use

NextCh 30. Multi-modal LLMs

→

Retrieval-Augmented Generation

Learning Objectives

Why Retrieval-Augmented Generation?

The Problems RAG Solves

The Core Idea: Look It Up, Then Answer

The RAG Pipeline

Offline: Building the Index

Pipeline Flow: Offline phase: prepare the knowledge base (done once)

Online: Answering a Query

Tool Trace: Online phase: the retrieve-then-generate flow

Embeddings and Dense Retrieval

From Text to Vectors

Shape Trace: Embedding text for retrieval

Measuring Similarity

Vector Databases and FAISS

The Problem: Exact Search Is Too Slow

How ANN Indexes Work

FAISS: A Vector Search Library

Chunking Strategies

Why Chunk At All?

The Chunking Trade-off

Chunking Strategies

Hybrid Search

Where Each Method Wins and Fails

Combining the Scores

Reranking

Why a Second Pass Helps

Arch Stack: Two-stage retrieval: retrieve wide, then rerank precisely

Assembling the Context

Building the Prompt

Instructions That Reduce Hallucination

The Lost-in-the-Middle Problem

Common RAG Failure Modes

Debugging RAG: Check Retrieval First

Advanced RAG Patterns

Query Rewriting and Expansion

Agentic RAG: Retrieval as a Tool

Tool Trace: Agentic RAG: the model retrieves as needed

RAG vs Fine-Tuning vs Long Context

Why RAG Often Wins for Knowledge

They Combine

Putting It Together: A Complete RAG System

Pipeline Flow: A complete RAG system

Chapter Summary & Exercises

RAG Quick-Reference

Exercises