Retrieval-Augmented Generation
In Chapter 28 the model gained the ability to call tools. One of the most valuable things to fetch is KNOWLEDGE — facts, documents, and data the model was never trained on or cannot reliably recall. Retrieval-Augmented Generation (RAG) is the technique for grounding a model's answers in an external knowledge source: retrieve the relevant text, put it in the prompt, and let the model answer based on it. RAG is one of the most widely-used LLM techniques in production, and this chapter builds it from the ground up.
The Problems RAG Solves
| Problem | Without RAG | With RAG |
|---|---|---|
| Stale knowledge | Frozen at training cutoff | Retrieve current documents |
| No private data | Never saw your files | Retrieve from your knowledge base |
| Hallucination | Invents plausible facts | Grounds answers in real text |
| No sources | Can't cite anything | Cites the retrieved documents |
| Limited memory | Can't hold all knowledge | Stores knowledge externally |
The Core Idea: Look It Up, Then Answer
RAG mirrors how a careful person answers a hard question: rather than answering from memory alone, they LOOK IT UP in a reliable source, then answer based on what they found. RAG gives a model the same workflow. Given a question, the system first RETRIEVES the most relevant pieces of text from a knowledge base, inserts them into the prompt, and asks the model to answer USING that retrieved context. The model's broad language ability is combined with specific, up-to-date, trustworthy information.
RAG has two phases. An OFFLINE phase prepares the knowledge base (done once, ahead of time), and an ONLINE phase answers each query (done per question). Understanding the two phases and how they fit together is the foundation for everything else.
Offline: Building the Index
Pipeline Flow: Offline phase: prepare the knowledge base (done once)
| 1 | Collect | Gather your documents (files, web pages, database records) |
| 2 | Chunk | Split documents into smaller passages (Section 29.5) |
| 3 | Embed | Convert each chunk into a vector capturing its meaning (Section 29.3) |
| 4 | Index | Store the vectors in a vector database for fast search (Section 29.4) |
Online: Answering a Query
When a question arrives, the system retrieves relevant chunks and generates a grounded answer. This is the retrieve-then-generate flow — the heart of RAG:
Tool Trace: Online phase: the retrieve-then-generate flow
| User | What is our company's parental leave policy? | → |
| Embed | Convert the question into a query vector | • |
| Retriever | Find the most similar chunks in the vector DB | → |
| Vector DB | Returns top-k relevant passages from the HR handbook | ← |
| App | Stuffs the retrieved passages into the prompt as context | • |
| Model | Generates an answer grounded in the retrieved policy text | • |
| User | 'Employees get 16 weeks of paid leave... [grounded in the handbook]' | ← |
# OFFLINE (once): build the index
for each document: chunk it, embed each chunk, store vectors in the index
# ONLINE (per query): retrieve then generate
1. embed the user's query into a vector
2. retrieve the top-k most similar chunks from the index
3. (optional) rerank the chunks for relevance
4. build a prompt: question + retrieved chunks as context
5. the model generates an answer grounded in that contextThe heart of modern retrieval is the EMBEDDING — a vector that captures the MEANING of a piece of text, building directly on the embeddings of Chapter 8. Dense retrieval uses embeddings to find text by SEMANTIC similarity: passages that MEAN the same thing as the query, even if they share no words. This is what lets RAG find 'parental leave' content when you ask about 'maternity time off'.
From Text to Vectors
An embedding model converts any piece of text into a fixed-length vector — a list of numbers — positioned so that texts with SIMILAR MEANING have NEARBY vectors. 'How do I reset my password?' and 'I forgot my login credentials' land close together in the vector space, despite sharing almost no words, because they mean nearly the same thing. This semantic matching is the superpower of dense retrieval over keyword search.
Shape Trace: Embedding text for retrieval
| Operation | Shape | Note |
|---|---|---|
| query text | "reset password" | raw string |
| embedding model | → | encode |
| query vector | (768,) | captures meaning |
| compare to chunk vectors | (N, 768) | cosine similarity |
| top-k nearest | (k, 768) | most similar chunks |
Measuring Similarity
To find the chunks most similar to the query, we compare their vectors. The standard measure is COSINE SIMILARITY — the cosine of the angle between two vectors (Chapter 1) — which is high when vectors point in the same direction (similar meaning) and low when they don't. Retrieval finds the k chunks whose vectors are most similar to the query vector.
similarity(q, c) = (q · c) / (||q|| · ||c||)
q = query vector, c = chunk vector
= 1.0 when meanings are identical (vectors aligned)
= 0.0 when unrelated (vectors orthogonal)
# Retrieve the k chunks with the HIGHEST similarity to the query.import numpy as np
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')
# OFFLINE: embed all chunks once
chunks = ['Reset your password in Settings.', 'Office hours are 9-5.', ...]
chunk_vecs = embedder.encode(chunks, normalize_embeddings=True) # (N, 768)
def retrieve(query, k=3):
"""Return the k chunks most similar to the query."""
q = embedder.encode(query, normalize_embeddings=True) # (768,)
# Cosine similarity = dot product (vectors are normalized)
scores = chunk_vecs @ q # (N,)
top = np.argsort(-scores)[:k] # k highest
return [(chunks[i], scores[i]) for i in top]
retrieve('I forgot my login') # -> finds the password-reset chunk,
# despite sharing no words. Semantic match!Computing cosine similarity against every chunk works for a few thousand chunks, but real knowledge bases have millions or billions. Comparing the query to every single vector (a 'brute-force' search) becomes too slow. VECTOR DATABASES solve this with specialized indexes that find the nearest vectors quickly — the infrastructure that makes RAG scale.
The Problem: Exact Search Is Too Slow
Brute-force search compares the query against all N chunk vectors — O(N) per query. For millions of vectors and many queries per second, this is prohibitively slow. We need a way to find the nearest vectors WITHOUT comparing against all of them. The trick is to accept APPROXIMATE answers: find vectors that are ALMOST certainly among the nearest, far faster than guaranteeing the exact nearest.
How ANN Indexes Work
ANN indexes organize vectors so that search can skip most of them. Two popular families: (1) HNSW (Hierarchical Navigable Small World) builds a navigable graph where search hops toward the query through a few well-connected nodes; (2) IVF (Inverted File) clusters vectors and searches only the nearest clusters. Both turn an O(N) scan into something far faster, at the cost of occasionally missing a true nearest neighbor.
| Index type | How it works | Trade-off |
|---|---|---|
| Flat (brute force) | Compare against every vector | Exact, but O(N) slow |
| IVF | Cluster, search nearest clusters only | Fast, may miss some |
| HNSW | Navigable graph, hop toward query | Fast, high recall, more memory |
| PQ (quantization) | Compress vectors to save memory | Smaller, slight accuracy loss |
FAISS: A Vector Search Library
FAISS (Facebook AI Similarity Search) is a widely-used library implementing these indexes. It is a LIBRARY (you embed it in your application), as opposed to managed vector DATABASES (Pinecone, Weaviate, Qdrant, Milvus, pgvector) that add storage, metadata filtering, and operations. For learning and many applications, FAISS is the workhorse; for production with persistence and scaling needs, a managed vector database often fits better.
import faiss; import numpy as np
# OFFLINE: build the index from chunk vectors (N, dim)
dim = 768
index = faiss.IndexFlatIP(dim) # inner product = cosine (normalized)
index.add(chunk_vecs) # add all chunk vectors
# For millions of vectors, use an ANN index instead of Flat:
# index = faiss.IndexHNSWFlat(dim, 32) # HNSW graph index
# ONLINE: search for the k nearest chunks to a query vector
scores, ids = index.search(query_vec[None], k=5)
results = [chunks[i] for i in ids[0]]
# Flat is exact but O(N); HNSW/IVF are approximate but scale to millions.
# Managed DBs (Pinecone, Qdrant, pgvector) add persistence + filtering.Before documents can be embedded and indexed, they must be split into CHUNKS — smaller passages. Chunking is deceptively important: get it wrong and even great embeddings and search produce poor results, because the units being retrieved are the wrong size or split mid-thought. Beginners often overlook chunking; experts know it is one of the biggest levers on RAG quality.
Why Chunk At All?
Two reasons. First, embeddings work best on focused passages — embedding an entire 50-page document into one vector blurs its many topics into mush, so retrieval can't distinguish them. Second, the model's context window is finite — you can only fit so much retrieved text in the prompt, so you want to retrieve focused, relevant passages, not whole documents. Chunking creates retrievable units that are focused enough to embed well and small enough to fit in context.
The Chunking Trade-off
| Chunks too small | Chunks too large |
|---|---|
| Lose surrounding context | Dilute the relevant part |
| Answer split across chunks | Embedding blurs many topics |
| Retrieve fragments | Waste context-window space |
| Miss the full picture | Retrieve irrelevant text too |
| e.g. single sentences | e.g. whole documents |
The art is finding a chunk size that is focused enough to embed and retrieve precisely, yet large enough to contain a complete idea. There is no universal answer — it depends on your documents and queries — but a few hundred tokens with some overlap is a common starting point.
Chunking Strategies
| Strategy | How it splits |
|---|---|
| Fixed-size | Every N tokens/characters — simple but may split mid-sentence |
| With overlap | Fixed-size, but chunks overlap so context isn't lost at boundaries |
| Sentence/paragraph | Split on natural boundaries — keeps ideas intact |
| Recursive | Try paragraphs, then sentences, then words to hit a target size |
| Semantic | Split where the topic shifts (detected by embedding changes) |
| Structure-aware | Respect document structure (headings, sections, code blocks) |
def chunk_with_overlap(text, size=400, overlap=50):
"""Split text into overlapping chunks (in tokens/words)."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + size
chunks.append(' '.join(words[start:end]))
# Step forward by (size - overlap) so chunks OVERLAP
start += size - overlap
return chunks
# Overlap ensures an idea spanning a chunk boundary appears WHOLE
# in at least one chunk -- so it can be retrieved intact. Better still:
# split on paragraph/sentence boundaries to avoid cutting mid-thought.Dense (semantic) retrieval and sparse (keyword) retrieval each have strengths and weaknesses. HYBRID SEARCH combines them to get the best of both — and in practice, hybrid search usually beats either alone. Understanding why requires seeing exactly where each method shines and fails.
Where Each Method Wins and Fails
| Query type | Dense wins | Keyword wins |
|---|---|---|
| Synonyms / paraphrases | ✓ strong | ✗ misses |
| Exact terms / codes / IDs | ✗ may blur | ✓ exact |
| Names, acronyms, rare words | ✗ weak | ✓ strong |
| Conceptual / fuzzy questions | ✓ strong | ✗ weak |
| Typos / morphology | varies | ✗ brittle |
The pattern is clear: dense retrieval excels at MEANING (synonyms, concepts, fuzzy questions), while keyword search excels at EXACTNESS (specific terms, names, codes, rare words). A query like 'error code TX-409 in the billing module' needs BOTH — the exact code (keyword) and the conceptual context of billing errors (dense). Hybrid search runs both and combines the results.
Combining the Scores
The standard way to merge dense and keyword results is RECIPROCAL RANK FUSION (RRF): each method ranks the chunks, and a chunk's combined score is based on its RANK in each list (not its raw score, which aren't comparable across methods). Chunks that rank highly in EITHER method bubble to the top. This is simple, robust, and avoids the problem of dense and keyword scores being on different scales.
score(chunk) = Σ 1 / (k + rank_in_list)
lists
k ≈ 60 (a smoothing constant)
rank_in_list = the chunk's position in each method's ranking
# Chunks ranked highly by EITHER dense or keyword search score well.
# Uses RANKS, not raw scores -- so the two methods' scales don't matter.def reciprocal_rank_fusion(dense_ids, keyword_ids, k=60):
"""Fuse two ranked lists into one combined ranking."""
scores = {}
for ranked_list in [dense_ids, keyword_ids]:
for rank, doc_id in enumerate(ranked_list):
# Earlier rank (smaller index) -> bigger contribution
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
# Sort by combined score, highest first
return sorted(scores, key=scores.get, reverse=True)
dense = dense_retrieve(query, k=20) # semantic results
keyword = bm25_retrieve(query, k=20) # keyword (BM25) results
final = reciprocal_rank_fusion(dense, keyword)[:10]
# Best of both: semantic understanding + exact-term matching.Initial retrieval (dense, keyword, or hybrid) is fast but coarse — it casts a wide net to find candidate chunks. RERANKING adds a second, more precise pass: take the top candidates from retrieval and re-score them with a more powerful (but slower) model that judges relevance more accurately. This two-stage 'retrieve-then-rerank' approach significantly improves the quality of the final context.
Why a Second Pass Helps
Fast retrieval embeds the query and each chunk SEPARATELY, then compares vectors — the chunk's vector was computed without ever seeing the query. A RERANKER instead looks at the query and a candidate chunk TOGETHER, in one model pass, judging how well that specific chunk answers that specific query. This 'cross-encoder' approach is far more accurate, but too slow to run over millions of chunks — so it is applied only to the top candidates from fast retrieval.
Arch Stack: Two-stage retrieval: retrieve wide, then rerank precisely
| Final top-k (e.g. 5) | fed to the model as context |
| Reranker (cross-encoder) | scores query+chunk together, accurately |
| Candidates (e.g. top 50) | from fast retrieval |
| Fast retrieval (bi-encoder / ANN) | over millions of chunks |
| Bi-encoder (retrieval) | Cross-encoder (reranking) |
|---|---|
| Embeds query & chunk separately | Processes query + chunk together |
| Vectors precomputed offline | Computed per query-chunk pair |
| Fast — scales to millions | Slow — only for top candidates |
| Coarse relevance | Precise relevance |
| Stage 1: cast a wide net | Stage 2: pick the best few |
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def retrieve_and_rerank(query, k_retrieve=50, k_final=5):
# Stage 1: fast retrieval casts a wide net
candidates = retrieve(query, k=k_retrieve) # 50 candidate chunks
# Stage 2: rerank by scoring (query, chunk) pairs together
pairs = [(query, c) for c in candidates]
scores = reranker.predict(pairs) # precise relevance scores
# Keep the k_final best after reranking
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [c for c, s in ranked[:k_final]]
# Retrieve 50 cheaply, rerank to the best 5 precisely. The 5 that
# actually go in the prompt are far more relevant than raw retrieval's top 5.Once the best chunks are retrieved and reranked, they must be assembled into the prompt — 'context stuffing'. This step seems trivial but has real subtleties: how much to include, in what order, and how to instruct the model to use it. Done poorly, even perfect retrieval yields bad answers.
Building the Prompt
def build_rag_prompt(query, chunks):
"""Stuff retrieved chunks into a grounded-answer prompt."""
context = '\n\n'.join(
f'[Source {i+1}] {chunk}' for i, chunk in enumerate(chunks)
)
return f"""Answer the question using ONLY the sources below.
If the answer isn't in the sources, say you don't know.
Cite sources by number.
Sources:
{context}
Question: {query}
Answer:"""
# Key instructions: ground in the sources, admit ignorance, cite.
# These instructions are what make RAG reduce hallucination.Instructions That Reduce Hallucination
The prompt instructions matter as much as the retrieved text. Three instructions are crucial: (1) 'answer using ONLY the sources' — grounds the model in the retrieved text; (2) 'if the answer isn't in the sources, say you don't know' — prevents the model from filling gaps with hallucination; (3) 'cite sources by number' — makes answers verifiable and traceable. These instructions are what turn retrieved text into trustworthy, grounded answers.
The Lost-in-the-Middle Problem
A surprising and important finding (Liu et al., 2023): models pay MORE attention to information at the BEGINNING and END of the context than in the MIDDLE. If you stuff ten chunks and the crucial one lands in the middle, the model may overlook it — a phenomenon called 'lost in the middle'. This is why reranking matters beyond just filtering: placing the most relevant chunks at the start (and end) of the context, not buried in the middle, improves the model's use of them.
RAG systems fail in characteristic ways, and — as stressed in Section 29.2 — most failures are in the RETRIEVAL half, not the generation. Knowing the failure modes and their fixes makes debugging systematic.
| Failure mode | What happens | Fix |
|---|---|---|
| Wrong chunks retrieved | Irrelevant context | Better embeddings, hybrid, rerank |
| Relevant chunk missed | Answer not in context | Higher k, better chunking, hybrid |
| Answer split across chunks | No single chunk has it | Larger chunks, overlap, merging |
| Lost in the middle | Crucial chunk ignored | Rerank; best chunks first/last |
| Hallucinates anyway | Ignores or overrides context | Stronger grounding instructions |
| Outdated index | Stale answers | Re-index when documents change |
| No relevant docs exist | Should say 'I don't know' | 'Say I don't know' instruction |
Debugging RAG: Check Retrieval First
When a RAG answer is wrong, the systematic first step is to INSPECT WHAT WAS RETRIEVED. Print the chunks that went into the prompt. Usually one of two things is true: either the right chunk was NOT retrieved (a retrieval problem — fix chunking, embeddings, k, or add hybrid/reranking), or the right chunk WAS retrieved but the model didn't use it (a generation problem — fix the prompt instructions or chunk ordering). Distinguishing these two cases tells you exactly where to focus.
The basic retrieve-then-generate pipeline can be extended in many ways to handle harder cases. These advanced patterns address specific weaknesses of vanilla RAG and are increasingly common in production systems.
| Pattern | What it adds |
|---|---|
| Query rewriting | Rephrase/expand the query before retrieval for better matches |
| Multi-query | Generate several query variations, retrieve for each, merge |
| HyDE | Generate a hypothetical answer, embed IT, and retrieve with that |
| Agentic RAG | The model decides when and what to retrieve, via tool calls |
| Self-RAG | The model critiques whether retrieval is needed and if results suffice |
| GraphRAG | Build a knowledge graph; retrieve over entities and relationships |
| Contextual retrieval | Prepend document context to each chunk before embedding |
Query Rewriting and Expansion
Users phrase questions in ways that don't match how documents are written. Query rewriting uses the model to rephrase or expand the query before retrieval — turning a terse 'parental leave?' into 'What is the company parental leave policy, including duration and eligibility?' — which retrieves better. Multi-query retrieval generates several phrasings and merges their results, improving recall.
Agentic RAG: Retrieval as a Tool
The most important modern pattern connects directly to Chapter 28: treat retrieval as a TOOL the model can call. Instead of always retrieving once at the start, the model DECIDES when it needs to look something up, formulates the query itself, and can retrieve multiple times as it works through a problem (the ReAct loop). This 'agentic RAG' handles complex, multi-step questions that a single retrieval cannot — the model retrieves, reasons, retrieves again, and synthesizes.
Tool Trace: Agentic RAG: the model retrieves as needed
| User | Compare our 2023 and 2024 revenue and explain the change. | → |
| Model | I need both years' figures — retrieve(2023 revenue report) | → |
| Retriever | Returns the 2023 financials chunk | ← |
| Model | Now retrieve(2024 revenue report) | → |
| Retriever | Returns the 2024 financials chunk | ← |
| Model | Has both — synthesizes the comparison and explanation | ← |
RAG is one of three ways to get a model to use specific knowledge. The others are FINE-TUNING (bake the knowledge into the weights, Chapter 22) and LONG CONTEXT (just put everything in the prompt). Each has its place, and knowing when to use which is an important practical judgment.
| Approach | How knowledge enters | Best when |
|---|---|---|
| RAG | Retrieved into the prompt per query | Large, changing knowledge base |
| Fine-tuning | Baked into the weights | Teaching style/format/skills |
| Long context | All stuffed into the prompt | Small, fixed knowledge that fits |
Why RAG Often Wins for Knowledge
For injecting KNOWLEDGE (facts, documents, data), RAG has decisive advantages over fine-tuning: it updates instantly (just change the index — no retraining), scales to far more knowledge than fits in weights or context, provides citations, and keeps knowledge separate from the model so it's auditable and current. Fine-tuning is better for teaching SKILLS, STYLE, or FORMAT — how to behave — rather than facts to recall.
They Combine
RAG and fine-tuning are not mutually exclusive — the strongest systems often use both: fine-tune the model for the domain's style and the skill of using retrieved context well, AND use RAG to supply current, specific knowledge at query time. And as context windows grow (Chapter 33), the line between RAG and long context shifts — but even with huge contexts, retrieval remains valuable for selecting WHAT to put in the context from a knowledge base far too large to fit entirely.
Let us assemble the whole chapter into a complete, production-minded RAG system, integrating chunking, embedding, indexing, hybrid retrieval, reranking, context assembly, and grounded generation.
Pipeline Flow: A complete RAG system
| 1 | Ingest & chunk | Split documents on natural boundaries, with overlap |
| 2 | Embed & index | Embed chunks; store in a vector DB with metadata |
| 3 | Hybrid retrieve | Dense + keyword search, fused with RRF |
| 4 | Rerank | Cross-encoder picks the best few from the candidates |
| 5 | Assemble | Best chunks first/last; grounding + citation instructions |
| 6 | Generate | Model answers grounded in context, cites sources |
| 7 | Evaluate | Measure retrieval recall and answer faithfulness separately |
class RAGSystem:
def __init__(self, documents):
# OFFLINE: chunk, embed, index
self.chunks = [c for d in documents for c in chunk_with_overlap(d)]
self.vecs = embedder.encode(self.chunks, normalize_embeddings=True)
self.index = build_faiss_index(self.vecs)
self.bm25 = build_bm25(self.chunks) # for hybrid search
def answer(self, query):
# 1. Hybrid retrieve (dense + keyword), fuse with RRF
dense = faiss_search(self.index, embedder.encode(query), k=30)
keyword = self.bm25.search(query, k=30)
candidates = reciprocal_rank_fusion(dense, keyword)[:30]
# 2. Rerank to the best few
top = rerank(query, candidates, k_final=5)
# 3. Assemble grounded prompt and generate
prompt = build_rag_prompt(query, top) # cite + 'say I don't know'
return model.generate(prompt)
# Each stage (chunk, embed, hybrid, rerank, ground) adds robustness.
# Frameworks (LlamaIndex, LangChain) provide this plumbing prebuilt.RAG Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Why RAG | Ground answers in external knowledge | Open-book exam for models |
| The pipeline | Retrieve, then generate | Retrieval quality dominates |
| Dense retrieval | Match by meaning via embeddings | Finds synonyms/paraphrases |
| Vector DB / FAISS | Fast approximate nearest-neighbor | Scales to millions of vectors |
| Chunking | Split documents into passages | Overlap; natural boundaries |
| Hybrid search | Dense + keyword, fused (RRF) | Best of meaning + exactness |
| Reranking | Precise second pass on candidates | Retrieve wide, rerank narrow |
| Context assembly | Stuff best chunks, instruct to ground | Beware lost-in-the-middle |
| RAG vs fine-tune | Knowledge vs skill | RAG updates instantly, cites |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–22 require code.
Further reading: “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” (Lewis et al., 2020) — the original RAG paper. “Dense Passage Retrieval” (Karpukhin et al., 2020). “Billing-scale similarity search with GPUs” and the FAISS library (Johnson et al., 2017). “HNSW” (Malkov & Yashunin, 2018) for graph-based ANN. “Lost in the Middle” (Liu et al., 2023). “Precise Zero-Shot Dense Retrieval (HyDE)” (Gao et al., 2022). “Self-RAG” (Asai et al., 2023). “Contextual Retrieval” (Anthropic, 2024). The RAGAS framework for RAG evaluation.
Next → Chapter 30: Multi-modal LLMs
So far our models work entirely in text. But the world is not only text — it is images, audio, and video. Chapter 30 extends LLMs to MULTIPLE MODALITIES: models that can SEE images and HEAR audio, not just read text. We will see how images are turned into tokens the model can process (vision encoders and projection), how text and visual information are fused, how these models are trained, and what they can do — from describing images to answering questions about charts and documents. The text-only assistant becomes one that perceives the world more like we do.