Part VI: Productionization
Chapter 29

Retrieval-Augmented Generation

RAG, vector databases, and knowledge grounding
22 Exercises
29.1

In Chapter 28 the model gained the ability to call tools. One of the most valuable things to fetch is KNOWLEDGE — facts, documents, and data the model was never trained on or cannot reliably recall. Retrieval-Augmented Generation (RAG) is the technique for grounding a model's answers in an external knowledge source: retrieve the relevant text, put it in the prompt, and let the model answer based on it. RAG is one of the most widely-used LLM techniques in production, and this chapter builds it from the ground up.

The Problems RAG Solves

ProblemWithout RAGWith RAG
Stale knowledgeFrozen at training cutoffRetrieve current documents
No private dataNever saw your filesRetrieve from your knowledge base
HallucinationInvents plausible factsGrounds answers in real text
No sourcesCan't cite anythingCites the retrieved documents
Limited memoryCan't hold all knowledgeStores knowledge externally

The Core Idea: Look It Up, Then Answer

RAG mirrors how a careful person answers a hard question: rather than answering from memory alone, they LOOK IT UP in a reliable source, then answer based on what they found. RAG gives a model the same workflow. Given a question, the system first RETRIEVES the most relevant pieces of text from a knowledge base, inserts them into the prompt, and asks the model to answer USING that retrieved context. The model's broad language ability is combined with specific, up-to-date, trustworthy information.

Retrieval-Augmented Generation (RAG)
A technique that improves a model's answers by retrieving relevant text from an external knowledge source and providing it in the prompt, so the model generates answers grounded in that retrieved information rather than from its weights alone.
RAG Note: RAG = Open-Book Exam
The cleanest analogy: a model without RAG is taking a CLOSED-BOOK exam — it must answer from memory, and it may misremember or make things up. RAG turns it into an OPEN-BOOK exam — the model can consult the relevant pages before answering. Just as open-book exams produce more accurate, citable answers, RAG produces answers grounded in real, current, source-able text.
And like a real open-book exam, the quality depends on finding the RIGHT pages quickly. A student who can't locate the relevant passage does poorly even with the book open. Most of this chapter is about the retrieval half — finding the right text — because that is what makes or breaks a RAG system.
29.2

RAG has two phases. An OFFLINE phase prepares the knowledge base (done once, ahead of time), and an ONLINE phase answers each query (done per question). Understanding the two phases and how they fit together is the foundation for everything else.

Offline: Building the Index

Pipeline Flow: Offline phase: prepare the knowledge base (done once)

1CollectGather your documents (files, web pages, database records)
2ChunkSplit documents into smaller passages (Section 29.5)
3EmbedConvert each chunk into a vector capturing its meaning (Section 29.3)
4IndexStore the vectors in a vector database for fast search (Section 29.4)

Online: Answering a Query

When a question arrives, the system retrieves relevant chunks and generates a grounded answer. This is the retrieve-then-generate flow — the heart of RAG:

Tool Trace: Online phase: the retrieve-then-generate flow

UserWhat is our company's parental leave policy?
EmbedConvert the question into a query vector
RetrieverFind the most similar chunks in the vector DB
Vector DBReturns top-k relevant passages from the HR handbook
AppStuffs the retrieved passages into the prompt as context
ModelGenerates an answer grounded in the retrieved policy text
User'Employees get 16 weeks of paid leave... [grounded in the handbook]'
textThe RAG pipeline (Pseudocode)
# OFFLINE (once): build the index
for each document: chunk it, embed each chunk, store vectors in the index

# ONLINE (per query): retrieve then generate
1. embed the user's query into a vector
2. retrieve the top-k most similar chunks from the index
3. (optional) rerank the chunks for relevance
4. build a prompt: question + retrieved chunks as context
5. the model generates an answer grounded in that context
RAG Note: Two Halves: Retrieval and Generation
RAG has a RETRIEVAL half (find the right text) and a GENERATION half (write a good answer from it). Beginners often focus on the generation — the model — but in practice, RETRIEVAL quality dominates RAG success. If retrieval surfaces the wrong passages, even the best model produces a wrong or ungrounded answer ('garbage in, garbage out'). Most of the engineering effort, and most of this chapter, is on retrieval.
A useful rule of thumb: if your RAG system gives bad answers, suspect retrieval first. Check whether the right passages were actually retrieved before blaming the model. More often than not, the model answered correctly given what it was handed — the problem was that it was handed the wrong context.
29.3

The heart of modern retrieval is the EMBEDDING — a vector that captures the MEANING of a piece of text, building directly on the embeddings of Chapter 8. Dense retrieval uses embeddings to find text by SEMANTIC similarity: passages that MEAN the same thing as the query, even if they share no words. This is what lets RAG find 'parental leave' content when you ask about 'maternity time off'.

From Text to Vectors

An embedding model converts any piece of text into a fixed-length vector — a list of numbers — positioned so that texts with SIMILAR MEANING have NEARBY vectors. 'How do I reset my password?' and 'I forgot my login credentials' land close together in the vector space, despite sharing almost no words, because they mean nearly the same thing. This semantic matching is the superpower of dense retrieval over keyword search.

Shape Trace: Embedding text for retrieval

OperationShapeNote
query text"reset password"raw string
embedding modelencode
query vector(768,)captures meaning
compare to chunk vectors(N, 768)cosine similarity
top-k nearest(k, 768)most similar chunks

Measuring Similarity

To find the chunks most similar to the query, we compare their vectors. The standard measure is COSINE SIMILARITY — the cosine of the angle between two vectors (Chapter 1) — which is high when vectors point in the same direction (similar meaning) and low when they don't. Retrieval finds the k chunks whose vectors are most similar to the query vector.

textCosine similarity for retrieval
similarity(q, c) = (q · c) / (||q|| · ||c||)

q = query vector,  c = chunk vector
= 1.0  when meanings are identical (vectors aligned)
= 0.0  when unrelated (vectors orthogonal)

# Retrieve the k chunks with the HIGHEST similarity to the query.
PythonDense retrieval from scratch
import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')

# OFFLINE: embed all chunks once
chunks = ['Reset your password in Settings.', 'Office hours are 9-5.', ...]
chunk_vecs = embedder.encode(chunks, normalize_embeddings=True)  # (N, 768)

def retrieve(query, k=3):
    """Return the k chunks most similar to the query."""
    q = embedder.encode(query, normalize_embeddings=True)  # (768,)
    # Cosine similarity = dot product (vectors are normalized)
    scores = chunk_vecs @ q                       # (N,)
    top = np.argsort(-scores)[:k]                 # k highest
    return [(chunks[i], scores[i]) for i in top]

retrieve('I forgot my login')  # -> finds the password-reset chunk,
                          #    despite sharing no words. Semantic match!
RAG Note: Dense vs Sparse Retrieval
Traditional 'sparse' retrieval (like BM25) matches on KEYWORDS — it finds documents containing the query's words. 'Dense' retrieval matches on MEANING via embeddings. Dense retrieval handles synonyms and paraphrases that keyword search misses ('car' vs 'automobile'), but keyword search is unbeatable for exact terms, names, codes, and rare words that embeddings may blur. Each has strengths — which is why hybrid search (Section 29.6) combines them.
The embedding model matters enormously: a better embedding model places semantically-related texts closer together, directly improving retrieval. Choosing and sometimes fine-tuning the embedding model for your domain is one of the highest-leverage decisions in a RAG system.
29.4

Computing cosine similarity against every chunk works for a few thousand chunks, but real knowledge bases have millions or billions. Comparing the query to every single vector (a 'brute-force' search) becomes too slow. VECTOR DATABASES solve this with specialized indexes that find the nearest vectors quickly — the infrastructure that makes RAG scale.

The Problem: Exact Search Is Too Slow

Brute-force search compares the query against all N chunk vectors — O(N) per query. For millions of vectors and many queries per second, this is prohibitively slow. We need a way to find the nearest vectors WITHOUT comparing against all of them. The trick is to accept APPROXIMATE answers: find vectors that are ALMOST certainly among the nearest, far faster than guaranteeing the exact nearest.

Approximate Nearest Neighbor (ANN)
A search that finds vectors very close to the query without exhaustively comparing against all of them, trading a small chance of missing the true nearest neighbor for dramatically faster search.

How ANN Indexes Work

ANN indexes organize vectors so that search can skip most of them. Two popular families: (1) HNSW (Hierarchical Navigable Small World) builds a navigable graph where search hops toward the query through a few well-connected nodes; (2) IVF (Inverted File) clusters vectors and searches only the nearest clusters. Both turn an O(N) scan into something far faster, at the cost of occasionally missing a true nearest neighbor.

Index typeHow it worksTrade-off
Flat (brute force)Compare against every vectorExact, but O(N) slow
IVFCluster, search nearest clusters onlyFast, may miss some
HNSWNavigable graph, hop toward queryFast, high recall, more memory
PQ (quantization)Compress vectors to save memorySmaller, slight accuracy loss

FAISS: A Vector Search Library

FAISS (Facebook AI Similarity Search) is a widely-used library implementing these indexes. It is a LIBRARY (you embed it in your application), as opposed to managed vector DATABASES (Pinecone, Weaviate, Qdrant, Milvus, pgvector) that add storage, metadata filtering, and operations. For learning and many applications, FAISS is the workhorse; for production with persistence and scaling needs, a managed vector database often fits better.

PythonVector search with FAISS
import faiss; import numpy as np

# OFFLINE: build the index from chunk vectors (N, dim)
dim = 768
index = faiss.IndexFlatIP(dim)           # inner product = cosine (normalized)
index.add(chunk_vecs)                    # add all chunk vectors

# For millions of vectors, use an ANN index instead of Flat:
# index = faiss.IndexHNSWFlat(dim, 32)    # HNSW graph index

# ONLINE: search for the k nearest chunks to a query vector
scores, ids = index.search(query_vec[None], k=5)
results = [chunks[i] for i in ids[0]]

# Flat is exact but O(N); HNSW/IVF are approximate but scale to millions.
# Managed DBs (Pinecone, Qdrant, pgvector) add persistence + filtering.
Scale Note: Metadata Filtering Matters
Real RAG systems rarely search ALL vectors — they filter by METADATA first. You might restrict the search to a specific user's documents, a date range, a department, or a document type, THEN do vector search within that subset. This 'filtered vector search' is essential for correctness (don't retrieve another user's data) and relevance (search only the right corpus). It is a key feature distinguishing production vector databases from a bare FAISS index.
Combining metadata filters with vector similarity is also a performance question: filtering first shrinks the search space, but must be done carefully so the ANN index still works efficiently. Production vector databases are largely about doing this well at scale.
29.5

Before documents can be embedded and indexed, they must be split into CHUNKS — smaller passages. Chunking is deceptively important: get it wrong and even great embeddings and search produce poor results, because the units being retrieved are the wrong size or split mid-thought. Beginners often overlook chunking; experts know it is one of the biggest levers on RAG quality.

Why Chunk At All?

Two reasons. First, embeddings work best on focused passages — embedding an entire 50-page document into one vector blurs its many topics into mush, so retrieval can't distinguish them. Second, the model's context window is finite — you can only fit so much retrieved text in the prompt, so you want to retrieve focused, relevant passages, not whole documents. Chunking creates retrievable units that are focused enough to embed well and small enough to fit in context.

The Chunking Trade-off

Chunks too smallChunks too large
Lose surrounding contextDilute the relevant part
Answer split across chunksEmbedding blurs many topics
Retrieve fragmentsWaste context-window space
Miss the full pictureRetrieve irrelevant text too
e.g. single sentencese.g. whole documents

The art is finding a chunk size that is focused enough to embed and retrieve precisely, yet large enough to contain a complete idea. There is no universal answer — it depends on your documents and queries — but a few hundred tokens with some overlap is a common starting point.

Chunking Strategies

StrategyHow it splits
Fixed-sizeEvery N tokens/characters — simple but may split mid-sentence
With overlapFixed-size, but chunks overlap so context isn't lost at boundaries
Sentence/paragraphSplit on natural boundaries — keeps ideas intact
RecursiveTry paragraphs, then sentences, then words to hit a target size
SemanticSplit where the topic shifts (detected by embedding changes)
Structure-awareRespect document structure (headings, sections, code blocks)
PythonChunking with overlap
def chunk_with_overlap(text, size=400, overlap=50):
    """Split text into overlapping chunks (in tokens/words)."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + size
        chunks.append(' '.join(words[start:end]))
        # Step forward by (size - overlap) so chunks OVERLAP
        start += size - overlap
    return chunks

# Overlap ensures an idea spanning a chunk boundary appears WHOLE
# in at least one chunk -- so it can be retrieved intact. Better still:
# split on paragraph/sentence boundaries to avoid cutting mid-thought.
RAG Note: Overlap Prevents Boundary Loss
A key trick: make chunks OVERLAP slightly. Without overlap, an idea that straddles a chunk boundary gets split — half in one chunk, half in the next — and neither chunk contains the complete thought, so retrieval may miss it. With overlap, the boundary region appears in full in at least one chunk. A modest overlap (10–20% of chunk size) is cheap insurance against boundary loss.
Even better than fixed-size-with-overlap is splitting on NATURAL boundaries (paragraphs, sections), so chunks contain complete ideas by construction. Structure-aware chunking that respects headings and sections often outperforms naive fixed-size splitting, especially for well-structured documents.
29.6

Dense (semantic) retrieval and sparse (keyword) retrieval each have strengths and weaknesses. HYBRID SEARCH combines them to get the best of both — and in practice, hybrid search usually beats either alone. Understanding why requires seeing exactly where each method shines and fails.

Where Each Method Wins and Fails

Query typeDense winsKeyword wins
Synonyms / paraphrases✓ strong✗ misses
Exact terms / codes / IDs✗ may blur✓ exact
Names, acronyms, rare words✗ weak✓ strong
Conceptual / fuzzy questions✓ strong✗ weak
Typos / morphologyvaries✗ brittle

The pattern is clear: dense retrieval excels at MEANING (synonyms, concepts, fuzzy questions), while keyword search excels at EXACTNESS (specific terms, names, codes, rare words). A query like 'error code TX-409 in the billing module' needs BOTH — the exact code (keyword) and the conceptual context of billing errors (dense). Hybrid search runs both and combines the results.

Combining the Scores

The standard way to merge dense and keyword results is RECIPROCAL RANK FUSION (RRF): each method ranks the chunks, and a chunk's combined score is based on its RANK in each list (not its raw score, which aren't comparable across methods). Chunks that rank highly in EITHER method bubble to the top. This is simple, robust, and avoids the problem of dense and keyword scores being on different scales.

textReciprocal Rank Fusion (RRF)
score(chunk) = Σ   1 / (k + rank_in_list)
             lists

k ≈ 60 (a smoothing constant)
rank_in_list = the chunk's position in each method's ranking

# Chunks ranked highly by EITHER dense or keyword search score well.
# Uses RANKS, not raw scores -- so the two methods' scales don't matter.
PythonHybrid search with reciprocal rank fusion
def reciprocal_rank_fusion(dense_ids, keyword_ids, k=60):
    """Fuse two ranked lists into one combined ranking."""
    scores = {}
    for ranked_list in [dense_ids, keyword_ids]:
        for rank, doc_id in enumerate(ranked_list):
            # Earlier rank (smaller index) -> bigger contribution
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    # Sort by combined score, highest first
    return sorted(scores, key=scores.get, reverse=True)

dense   = dense_retrieve(query, k=20)     # semantic results
keyword = bm25_retrieve(query, k=20)      # keyword (BM25) results
final   = reciprocal_rank_fusion(dense, keyword)[:10]

# Best of both: semantic understanding + exact-term matching.
RAG Note: Hybrid Is the Production Default
In production RAG, hybrid search is usually the right default. The cost of running both a dense and a keyword search is modest, and the robustness gain is large — you stop failing on the queries where one method alone is weak (exact codes for dense, paraphrases for keyword). Most mature vector databases support hybrid search natively, fusing dense and sparse results for you.
The lesson generalizes: when two methods have complementary strengths and combining them is cheap, combine them. Dense and keyword retrieval are not rivals — they are partners that cover each other's blind spots.
29.7

Initial retrieval (dense, keyword, or hybrid) is fast but coarse — it casts a wide net to find candidate chunks. RERANKING adds a second, more precise pass: take the top candidates from retrieval and re-score them with a more powerful (but slower) model that judges relevance more accurately. This two-stage 'retrieve-then-rerank' approach significantly improves the quality of the final context.

Why a Second Pass Helps

Fast retrieval embeds the query and each chunk SEPARATELY, then compares vectors — the chunk's vector was computed without ever seeing the query. A RERANKER instead looks at the query and a candidate chunk TOGETHER, in one model pass, judging how well that specific chunk answers that specific query. This 'cross-encoder' approach is far more accurate, but too slow to run over millions of chunks — so it is applied only to the top candidates from fast retrieval.

Arch Stack: Two-stage retrieval: retrieve wide, then rerank precisely

Final top-k (e.g. 5)fed to the model as context
Reranker (cross-encoder)scores query+chunk together, accurately
Candidates (e.g. top 50)from fast retrieval
Fast retrieval (bi-encoder / ANN)over millions of chunks
Bi-encoder (retrieval)Cross-encoder (reranking)
Embeds query & chunk separatelyProcesses query + chunk together
Vectors precomputed offlineComputed per query-chunk pair
Fast — scales to millionsSlow — only for top candidates
Coarse relevancePrecise relevance
Stage 1: cast a wide netStage 2: pick the best few
PythonReranking retrieved candidates
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def retrieve_and_rerank(query, k_retrieve=50, k_final=5):
    # Stage 1: fast retrieval casts a wide net
    candidates = retrieve(query, k=k_retrieve)    # 50 candidate chunks

    # Stage 2: rerank by scoring (query, chunk) pairs together
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)            # precise relevance scores

    # Keep the k_final best after reranking
    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, s in ranked[:k_final]]

# Retrieve 50 cheaply, rerank to the best 5 precisely. The 5 that
# actually go in the prompt are far more relevant than raw retrieval's top 5.
RAG Note: Reranking Is a High-ROI Addition
Adding a reranker is one of the most reliable ways to improve a RAG system's quality. Fast retrieval optimizes for RECALL (don't miss relevant chunks) by casting a wide net; the reranker optimizes for PRECISION (put the BEST chunks first) over that smaller set. Because the model only sees the top few reranked chunks, getting the ordering right directly improves answer quality. The extra latency is modest since reranking runs over dozens, not millions, of chunks.
The general pattern — a cheap, high-recall first stage followed by an expensive, high-precision second stage — recurs throughout retrieval and search systems. It lets you combine the scalability of fast methods with the accuracy of slow ones.
29.8

Once the best chunks are retrieved and reranked, they must be assembled into the prompt — 'context stuffing'. This step seems trivial but has real subtleties: how much to include, in what order, and how to instruct the model to use it. Done poorly, even perfect retrieval yields bad answers.

Building the Prompt

PythonAssembling retrieved context into a prompt
def build_rag_prompt(query, chunks):
    """Stuff retrieved chunks into a grounded-answer prompt."""
    context = '\n\n'.join(
        f'[Source {i+1}] {chunk}' for i, chunk in enumerate(chunks)
    )
    return f"""Answer the question using ONLY the sources below.
If the answer isn't in the sources, say you don't know.
Cite sources by number.

Sources:
{context}

Question: {query}
Answer:"""

# Key instructions: ground in the sources, admit ignorance, cite.
# These instructions are what make RAG reduce hallucination.

Instructions That Reduce Hallucination

The prompt instructions matter as much as the retrieved text. Three instructions are crucial: (1) 'answer using ONLY the sources' — grounds the model in the retrieved text; (2) 'if the answer isn't in the sources, say you don't know' — prevents the model from filling gaps with hallucination; (3) 'cite sources by number' — makes answers verifiable and traceable. These instructions are what turn retrieved text into trustworthy, grounded answers.

The Lost-in-the-Middle Problem

A surprising and important finding (Liu et al., 2023): models pay MORE attention to information at the BEGINNING and END of the context than in the MIDDLE. If you stuff ten chunks and the crucial one lands in the middle, the model may overlook it — a phenomenon called 'lost in the middle'. This is why reranking matters beyond just filtering: placing the most relevant chunks at the start (and end) of the context, not buried in the middle, improves the model's use of them.

⚠️
More Context Is Not Always Better
A common beginner mistake is stuffing as many chunks as possible into the context, reasoning that more information helps. It often hurts: irrelevant chunks DISTRACT the model, the crucial chunk gets lost in the middle, and you waste context window and money. Quality and ORDERING of context beat quantity. Retrieve widely, rerank precisely, and include only the few BEST chunks, placed where the model will attend to them.
This connects to long-context limits (previewed for Chapter 33): even models with huge context windows do not use all of it equally well. Feeding a model a focused, well-ordered handful of relevant passages reliably beats dumping a hundred mediocre ones — less, but better, is the rule for RAG context.
29.9

RAG systems fail in characteristic ways, and — as stressed in Section 29.2 — most failures are in the RETRIEVAL half, not the generation. Knowing the failure modes and their fixes makes debugging systematic.

Failure modeWhat happensFix
Wrong chunks retrievedIrrelevant contextBetter embeddings, hybrid, rerank
Relevant chunk missedAnswer not in contextHigher k, better chunking, hybrid
Answer split across chunksNo single chunk has itLarger chunks, overlap, merging
Lost in the middleCrucial chunk ignoredRerank; best chunks first/last
Hallucinates anywayIgnores or overrides contextStronger grounding instructions
Outdated indexStale answersRe-index when documents change
No relevant docs existShould say 'I don't know''Say I don't know' instruction

Debugging RAG: Check Retrieval First

When a RAG answer is wrong, the systematic first step is to INSPECT WHAT WAS RETRIEVED. Print the chunks that went into the prompt. Usually one of two things is true: either the right chunk was NOT retrieved (a retrieval problem — fix chunking, embeddings, k, or add hybrid/reranking), or the right chunk WAS retrieved but the model didn't use it (a generation problem — fix the prompt instructions or chunk ordering). Distinguishing these two cases tells you exactly where to focus.

RAG Note: Evaluate Retrieval and Generation Separately
Because RAG has two halves, evaluate them separately. For RETRIEVAL, measure whether the relevant chunks were found (recall@k: was the right chunk in the top k?). For GENERATION, measure whether the answer is faithful to the retrieved context (does it only claim what the sources support?) and whether it actually answers the question. A system can have great generation but poor retrieval, or vice versa — separate metrics pinpoint which to fix.
Tools and frameworks (RAGAS and others) automate these RAG-specific metrics: context relevance, answer faithfulness, and answer relevance. Measuring the two halves independently is the key discipline for improving a RAG system methodically rather than by guesswork.
29.10

The basic retrieve-then-generate pipeline can be extended in many ways to handle harder cases. These advanced patterns address specific weaknesses of vanilla RAG and are increasingly common in production systems.

PatternWhat it adds
Query rewritingRephrase/expand the query before retrieval for better matches
Multi-queryGenerate several query variations, retrieve for each, merge
HyDEGenerate a hypothetical answer, embed IT, and retrieve with that
Agentic RAGThe model decides when and what to retrieve, via tool calls
Self-RAGThe model critiques whether retrieval is needed and if results suffice
GraphRAGBuild a knowledge graph; retrieve over entities and relationships
Contextual retrievalPrepend document context to each chunk before embedding

Query Rewriting and Expansion

Users phrase questions in ways that don't match how documents are written. Query rewriting uses the model to rephrase or expand the query before retrieval — turning a terse 'parental leave?' into 'What is the company parental leave policy, including duration and eligibility?' — which retrieves better. Multi-query retrieval generates several phrasings and merges their results, improving recall.

Agentic RAG: Retrieval as a Tool

The most important modern pattern connects directly to Chapter 28: treat retrieval as a TOOL the model can call. Instead of always retrieving once at the start, the model DECIDES when it needs to look something up, formulates the query itself, and can retrieve multiple times as it works through a problem (the ReAct loop). This 'agentic RAG' handles complex, multi-step questions that a single retrieval cannot — the model retrieves, reasons, retrieves again, and synthesizes.

Tool Trace: Agentic RAG: the model retrieves as needed

UserCompare our 2023 and 2024 revenue and explain the change.
ModelI need both years' figures — retrieve(2023 revenue report)
RetrieverReturns the 2023 financials chunk
ModelNow retrieve(2024 revenue report)
RetrieverReturns the 2024 financials chunk
ModelHas both — synthesizes the comparison and explanation
RAG Note: RAG and Tool Calling Are Converging
Notice how agentic RAG IS tool calling (Chapter 28) with a retrieval tool. The line between 'RAG' and 'an agent that can search' is blurring: modern systems give the model a retrieval tool and let it decide when and what to retrieve, often multiple times, interleaved with reasoning. This is more flexible than a fixed retrieve-once pipeline and handles complex questions better.
The unifying view: retrieval is one of the most important tools you can give a model. Basic RAG is 'always retrieve once at the start'; agentic RAG is 'retrieve whenever the model judges it useful'. As models get better at this judgment, agentic retrieval increasingly subsumes the fixed pipeline.
29.11

RAG is one of three ways to get a model to use specific knowledge. The others are FINE-TUNING (bake the knowledge into the weights, Chapter 22) and LONG CONTEXT (just put everything in the prompt). Each has its place, and knowing when to use which is an important practical judgment.

ApproachHow knowledge entersBest when
RAGRetrieved into the prompt per queryLarge, changing knowledge base
Fine-tuningBaked into the weightsTeaching style/format/skills
Long contextAll stuffed into the promptSmall, fixed knowledge that fits

Why RAG Often Wins for Knowledge

For injecting KNOWLEDGE (facts, documents, data), RAG has decisive advantages over fine-tuning: it updates instantly (just change the index — no retraining), scales to far more knowledge than fits in weights or context, provides citations, and keeps knowledge separate from the model so it's auditable and current. Fine-tuning is better for teaching SKILLS, STYLE, or FORMAT — how to behave — rather than facts to recall.

Compare: RAG vs Fine-Tuning: Knowledge vs Skill
Use RAG to give the model KNOWLEDGE it should look up: your documents, current facts, large or changing information. Knowledge stays external, updatable, and citable.
Use fine-tuning to give the model a SKILL or STYLE it should internalize: a domain's way of writing, a specific output format, a behaviour. These belong in the weights. The two are complementary — fine-tune for how to behave, RAG for what to know.

They Combine

RAG and fine-tuning are not mutually exclusive — the strongest systems often use both: fine-tune the model for the domain's style and the skill of using retrieved context well, AND use RAG to supply current, specific knowledge at query time. And as context windows grow (Chapter 33), the line between RAG and long context shifts — but even with huge contexts, retrieval remains valuable for selecting WHAT to put in the context from a knowledge base far too large to fit entirely.

29.12

Let us assemble the whole chapter into a complete, production-minded RAG system, integrating chunking, embedding, indexing, hybrid retrieval, reranking, context assembly, and grounded generation.

Pipeline Flow: A complete RAG system

1Ingest & chunkSplit documents on natural boundaries, with overlap
2Embed & indexEmbed chunks; store in a vector DB with metadata
3Hybrid retrieveDense + keyword search, fused with RRF
4RerankCross-encoder picks the best few from the candidates
5AssembleBest chunks first/last; grounding + citation instructions
6GenerateModel answers grounded in context, cites sources
7EvaluateMeasure retrieval recall and answer faithfulness separately
PythonA complete RAG system (bringing it together)
class RAGSystem:
    def __init__(self, documents):
        # OFFLINE: chunk, embed, index
        self.chunks = [c for d in documents for c in chunk_with_overlap(d)]
        self.vecs = embedder.encode(self.chunks, normalize_embeddings=True)
        self.index = build_faiss_index(self.vecs)
        self.bm25  = build_bm25(self.chunks)        # for hybrid search

    def answer(self, query):
        # 1. Hybrid retrieve (dense + keyword), fuse with RRF
        dense   = faiss_search(self.index, embedder.encode(query), k=30)
        keyword = self.bm25.search(query, k=30)
        candidates = reciprocal_rank_fusion(dense, keyword)[:30]

        # 2. Rerank to the best few
        top = rerank(query, candidates, k_final=5)

        # 3. Assemble grounded prompt and generate
        prompt = build_rag_prompt(query, top)    # cite + 'say I don't know'
        return model.generate(prompt)

# Each stage (chunk, embed, hybrid, rerank, ground) adds robustness.
# Frameworks (LlamaIndex, LangChain) provide this plumbing prebuilt.
RAG Note: Start Simple, Add Stages as Needed
You do not need every stage on day one. Start with the simplest RAG that works: chunk, embed, retrieve top-k, stuff into a grounded prompt. Measure where it fails. Add hybrid search if exact terms are missed; add reranking if the right chunks are retrieved but ranked poorly; add query rewriting if user phrasing is the problem; go agentic if questions need multiple retrievals. Each stage targets a specific failure — add it when the data shows you need it.
This mirrors the engineering discipline from Part V: start simple, measure, and add complexity only where it earns its place. A well-tuned basic RAG often beats an elaborate one built without measurement.
29.13

RAG Quick-Reference

ConceptKey ideaRemember
Why RAGGround answers in external knowledgeOpen-book exam for models
The pipelineRetrieve, then generateRetrieval quality dominates
Dense retrievalMatch by meaning via embeddingsFinds synonyms/paraphrases
Vector DB / FAISSFast approximate nearest-neighborScales to millions of vectors
ChunkingSplit documents into passagesOverlap; natural boundaries
Hybrid searchDense + keyword, fused (RRF)Best of meaning + exactness
RerankingPrecise second pass on candidatesRetrieve wide, rerank narrow
Context assemblyStuff best chunks, instruct to groundBeware lost-in-the-middle
RAG vs fine-tuneKnowledge vs skillRAG updates instantly, cites

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–22 require code.

Exercise 1: Pen & Paper
List the problems RAG solves that a model alone cannot. Explain the open-book-exam analogy and why retrieval quality dominates.
Exercise 2: Pen & Paper
Describe the offline and online phases of the RAG pipeline. What happens in each, and why is retrieval the half to suspect first when answers are wrong?
Exercise 3: Pen & Paper
Explain dense retrieval. Why can it find relevant text that shares no words with the query, and what does cosine similarity measure?
Exercise 4: Pen & Paper
Contrast dense and sparse (keyword) retrieval. Give a query where each wins and one where you'd want both.
Exercise 5: Pen & Paper
Why is exact (brute-force) vector search too slow at scale? Explain approximate nearest neighbor and the recall/speed trade-off.
Exercise 6: Pen & Paper
Explain the chunking trade-off (too small vs too large). Why does overlap help, and why might natural-boundary chunking beat fixed-size?
Exercise 7: Pen & Paper
Explain hybrid search and reciprocal rank fusion. Why does RRF use ranks rather than raw scores?
Exercise 8: Pen & Paper
Explain reranking. Why is a cross-encoder more accurate than a bi-encoder, and why is it applied only to the top candidates?
Exercise 9: Pen & Paper
Describe the lost-in-the-middle problem and two ways to mitigate it. Why is 'more context' not always better?
Exercise 10: Pen & Paper
Compare RAG, fine-tuning, and long context for injecting knowledge. When is each best, and how do RAG and fine-tuning combine?
Exercise 11: Code
Implement dense retrieval from scratch: embed a small set of chunks, embed a query, and return the top-k by cosine similarity.
Exercise 12: Code
Build a FAISS index over chunk embeddings and retrieve the top-k for a query. Compare a Flat (exact) index to an HNSW (approximate) index on speed and recall.
Exercise 13: Code
Implement three chunking strategies (fixed-size, fixed-with-overlap, sentence-based) and compare retrieval quality on a small document set.
Exercise 14: Code
Implement BM25 keyword retrieval and combine it with dense retrieval using reciprocal rank fusion. Show a query where hybrid beats either alone.
Exercise 15: Code
Add a cross-encoder reranker: retrieve 50 candidates, rerank to the top 5, and compare the final 5 to retrieval's raw top 5 on relevance.
Exercise 16: Code
Build a RAG prompt with grounding and citation instructions. Demonstrate that the 'say I don't know' instruction prevents hallucination when no relevant chunk exists.
Exercise 17: Code
Demonstrate the lost-in-the-middle effect: place the answer chunk at the start, middle, and end of a long context and measure whether the model uses it.
Exercise 18: Code Lab
Build a complete basic RAG system (chunk, embed, index, retrieve, ground, generate) over a set of documents. Answer questions and inspect the retrieved chunks.
Exercise 19: Code Lab
Implement RAG evaluation: measure retrieval recall@k (was the right chunk retrieved?) and answer faithfulness (does the answer only claim what the sources support?) separately.
Exercise 20: Code
Implement query rewriting: use the model to expand a terse query before retrieval, and show it improves recall on under-specified questions.
Exercise 21: Code Lab
Build agentic RAG: give the model a retrieval tool (Chapter 28) and let it decide when and what to retrieve. Test it on a multi-hop question needing two retrievals.
Exercise 22: Code (Challenge)
Build a full production-style RAG system with overlapping natural-boundary chunking, hybrid (dense + BM25) retrieval fused with RRF, cross-encoder reranking, lost-in-the-middle-aware context ordering, and grounded generation with citations. Then build an evaluation set, measure retrieval recall and answer faithfulness, deliberately degrade each stage (bad chunking, no reranking, no hybrid) and quantify how much each stage contributed to the final quality.

Further reading: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020) — the original RAG paper. “Dense Passage Retrieval” (Karpukhin et al., 2020). “Billing-scale similarity search with GPUs” and the FAISS library (Johnson et al., 2017). “HNSW” (Malkov & Yashunin, 2018) for graph-based ANN. “Lost in the Middle” (Liu et al., 2023). “Precise Zero-Shot Dense Retrieval (HyDE)” (Gao et al., 2022). “Self-RAG” (Asai et al., 2023). “Contextual Retrieval” (Anthropic, 2024). The RAGAS framework for RAG evaluation.


Next → Chapter 30: Multi-modal LLMs

So far our models work entirely in text. But the world is not only text — it is images, audio, and video. Chapter 30 extends LLMs to MULTIPLE MODALITIES: models that can SEE images and HEAR audio, not just read text. We will see how images are turned into tokens the model can process (vision encoders and projection), how text and visual information are fused, how these models are trained, and what they can do — from describing images to answering questions about charts and documents. The text-only assistant becomes one that perceives the world more like we do.

22 Exercises in this chapter
Attempt each exercise before checking the worked solutions.
View Solutions →