Long Context & Memory
Chapter 32 scaled a model's PARAMETERS. This chapter scales something else: its CONTEXT LENGTH — how many tokens it can attend to at once. A model's context window is its working memory: everything it can 'see' while generating the next token. Early models had tiny windows (512–2048 tokens); today's reach hundreds of thousands or even millions. Growing this window is one of the most active frontiers, and it unlocks fundamentally new capabilities.
What Long Context Enables
| Capability | Why it needs long context |
|---|---|
| Whole-book understanding | Hold an entire book in the window at once |
| Large codebase reasoning | Attend to thousands of files together |
| Long document analysis | Reason over full contracts, papers, reports |
| Extended conversations | Remember a long dialogue without forgetting |
| Many-shot prompting | Hundreds of examples in the prompt |
| Agentic trajectories | Long chains of tool calls and observations (Ch. 28) |
| RAG with more retrieved context | Fit more retrieved passages (Ch. 29) |
Context Is the Model's Working Memory
Think of the context window as the model's short-term, working memory — distinct from the long-term knowledge baked into its weights. The weights hold what the model learned during training; the context holds what it is thinking about RIGHT NOW. A bigger context window means the model can hold more of the current problem in mind at once: more of the document, more of the conversation, more of the codebase. It does not make the model smarter, but it lets the model apply its intelligence to more information simultaneously.
There is a fundamental reason long context is hard, and it goes back to attention itself (Chapter 12). Standard attention has a cost that grows QUADRATICALLY with the sequence length — double the context and you quadruple the attention cost. This 'quadratic wall' is the central obstacle to long context, and nearly every technique in this chapter is a way around it.
Why Attention Is Quadratic
Recall how attention works: every token attends to every other token. With T tokens, that is T × T = T² pairwise interactions — the attention matrix is T-by-T. So the compute and memory for attention grow as T². At T=1,000 that is a million interactions; at T=1,000,000 it is a TRILLION. The cost explodes, which is why simply making the window bigger does not scale.
Each of T tokens attends to all T tokens → T × T = T² interactions
T = 1,000 → 1,000,000 interactions
T = 100,000 → 10,000,000,000
T = 1,000,000 → 1,000,000,000,000 (a trillion!)
# Compute AND the attention-matrix memory grow as O(T²).
# Doubling context QUADRUPLES the cost. This is the wall.The KV-Cache Cost Too
There is a second long-context cost, from inference (Chapter 27): the KV cache grows LINEARLY with context length, but at long contexts it becomes enormous — storing keys and values for a million tokens consumes vast memory. So long context strains both attention compute (quadratic) and KV-cache memory (linear but large). Both must be addressed to serve long contexts affordably.
Before we can extend context, we must understand a subtler obstacle: even setting aside compute cost, a model trained on 4K tokens usually FAILS if you simply feed it 8K — it produces garbage. The reason lies in POSITION ENCODINGS, the mechanism (Chapter 13) that tells the model WHERE each token is. Understanding position encodings is the key to extending context.
Why Position Matters
Recall that attention itself is order-blind — it sees a set of tokens, not a sequence. Position encodings inject order information so the model knows token 5 comes before token 50. But position encodings are LEARNED or DEFINED for the positions seen during training. Feed the model a position it never saw (position 6,000 when trained only to 4,096), and the position encoding is out of distribution — the model has no idea what that position means, and breaks.
RoPE: Rotary Position Embeddings
The dominant modern position encoding is RoPE (Rotary Position Embedding; Su et al., 2021). Instead of adding a position vector, RoPE ROTATES the query and key vectors by an angle proportional to their position. The dot product in attention then naturally depends on the RELATIVE distance between tokens — an elegant property. RoPE is used in LLaMA, Mistral, and most modern models, and crucially, its mathematical structure makes it EXTENDABLE in ways that earlier encodings were not.
RoPE rotates query/key vectors by an angle ∝ position:
q_m → rotate(q_m, mθ) # token at position m
k_n → rotate(k_n, nθ) # token at position n
Their dot product then depends on (m - n), the RELATIVE distance.
# Different frequencies θ encode position at different scales.
# The rotation structure is what makes RoPE extendable (next section).Here is a remarkable fact: you can take a model trained on 4K tokens and EXTEND it to 32K, 128K, or more — often with only a little extra fine-tuning — by cleverly manipulating its RoPE position encodings. These context-extension techniques are how most long-context models are actually made, and they all build on RoPE's rotational structure.
Position Interpolation: Squeeze, Don't Extrapolate
The breakthrough idea (Position Interpolation; Chen et al., 2023) is counterintuitive but simple. Instead of EXTRAPOLATING to new positions the model never saw (which fails), INTERPOLATE: squeeze the new, longer range of positions back into the range the model WAS trained on. To handle 8K tokens with a 4K-trained model, scale every position by 1/2, so positions run 0 to 4,096 (in-distribution) instead of 0 to 8,192 (out-of-distribution). The model sees only familiar position values, just spaced more finely.
NTK-Aware Scaling and YaRN
Plain interpolation works but blurs fine positional detail. Better methods scale different RoPE frequencies DIFFERENTLY. NTK-aware scaling interpolates low frequencies (long-range position) more than high frequencies (local position), preserving local precision. YaRN (Yet another RoPE extensioN; Peng et al., 2023) refines this further with a careful per-frequency scaling plus a temperature adjustment, achieving strong context extension with minimal fine-tuning. YaRN is one of the most widely-used extension methods.
| Method | How it extends context |
|---|---|
| Extrapolation (naive) | Just use larger positions — FAILS (out of distribution) |
| Position Interpolation | Linearly squeeze positions into the trained range |
| NTK-aware scaling | Scale low frequencies more than high — preserves local detail |
| YaRN | Per-frequency scaling + temperature — strong, little fine-tuning |
| LongRoPE | Search for optimal per-dimension scaling — reaches 2M+ tokens |
import torch
def interpolate_rope_positions(positions, trained_len, target_len):
"""Squeeze longer positions into the trained range (linear PI)."""
scale = trained_len / target_len # e.g. 4096/8192 = 0.5
# Every position is scaled DOWN so it falls in [0, trained_len]
return positions * scale
# A position-8000 token is treated as position 4000 -- in-distribution.
# Apply this scaling inside RoPE, then briefly fine-tune to adapt.
# YaRN/NTK scale different RoPE frequencies by different amounts instead
# of one uniform factor, preserving local positional precision.Context extension (Section 33.4) handles the POSITION problem, but the quadratic COMPUTE problem remains. The second family of solutions attacks attention directly: make it cheaper than O(T²). These efficient-attention methods either compute exact attention more cleverly, or approximate it by having each token attend to FEWER others.
FlashAttention: Same Math, Less Memory
First, a clarification: FlashAttention (Chapter 20) does NOT change attention's quadratic compute — it computes the EXACT same attention, but far more memory-efficiently by never materializing the full T×T matrix. It tiles the computation to keep data in fast on-chip memory. FlashAttention made long context practical by removing the MEMORY blowup of the attention matrix, even though the compute stays quadratic. It is the foundation that most long-context models build on.
Sparse Attention: Attend to Fewer Tokens
To beat the quadratic COMPUTE, sparse attention has each token attend to only a SUBSET of other tokens, not all of them. If each token attends to a fixed-size window or a sparse pattern instead of everything, the cost drops from O(T²) toward O(T). The art is choosing WHICH tokens to attend to so that little important information is lost.
| Pattern | What each token attends to |
|---|---|
| Sliding window | A fixed window of nearby tokens (local attention) |
| Dilated / strided | Every n-th token, to cover long range cheaply |
| Global + local | Most tokens attend locally; a few 'global' tokens attend to all |
| Block-sparse | Attention restricted to blocks (e.g. BigBird, Longformer) |
| Streaming / sink | Recent window + a few 'attention sink' tokens at the start |
A common and effective pattern combines LOCAL attention (each token attends to a nearby window) with a few GLOBAL tokens (that attend to and are attended by everything). Local attention captures nearby detail cheaply; the global tokens carry long-range information. This 'global + local' design (used in Longformer, BigBird) approximates full attention at a fraction of the cost.
Sparse attention skips tokens; LINEAR attention takes a different route — it mathematically reformulates attention so its cost grows LINEARLY with sequence length instead of quadratically, while still letting every token influence every other. This is achieved by changing the ORDER of the attention computation using a kernel approximation.
The Reordering Trick
Standard attention computes softmax(QKᵀ)V — the QKᵀ forms the T×T matrix (quadratic). Linear attention approximates the softmax with a kernel feature map, which lets you reorder the computation to (Q')(K'ᵀV) — computing K'ᵀV first. That inner product is a small fixed-size matrix (dimension × dimension), independent of T. The result: cost grows linearly with T. It is the same idea as choosing a cheaper order of matrix multiplication (Chapter 1), applied to attention.
Standard: softmax(Q Kᵀ) V — the Q Kᵀ is T × T (quadratic)
Linear: φ(Q) (φ(K)ᵀ V) — compute φ(K)ᵀ V first
φ(K)ᵀ V is d × d, INDEPENDENT of T
# φ = a kernel feature map approximating softmax.
# Cost becomes O(T) instead of O(T²). Can also run as a recurrence.The Recurrent View
Linear attention has a beautiful property: it can be written as a RECURRENCE — maintaining a fixed-size 'state' that is updated one token at a time, like an RNN (Chapter 9). This means linear-attention models can process sequences of UNBOUNDED length with constant memory per step, since the state size doesn't grow with context. This recurrent view is the bridge to state-space models, the most successful linear-time architecture, which we turn to next.
State-space models (SSMs), and especially MAMBA (Gu & Dao, 2023), are the most prominent attempt to replace attention with something that scales linearly. They have generated enormous excitement as a potential successor or complement to the Transformer for long sequences. Let us understand them at a beginner level.
The Core Idea: A Compressed Running State
An SSM processes a sequence by maintaining a fixed-size STATE that it updates as it reads each token — like an RNN, but based on the mathematics of continuous state-space systems from control theory. The state is a compressed summary of everything seen so far. Because the state is fixed-size, processing each token costs the same regardless of how long the sequence is — LINEAR total cost and CONSTANT memory per step, no quadratic wall, no growing KV cache.
Arch Stack: A state-space model: a fixed-size running state
| output token | from current state |
| fixed-size STATE | compressed summary of the past |
| update rule | state = A·state + B·input |
| input token | (d,) |
Mamba's Innovation: Selectivity
Classical SSMs had a weakness: their state update was FIXED — the same for every token — so they couldn't decide what to remember or forget based on content. Mamba's key innovation is SELECTIVITY: it makes the state update DEPEND ON THE INPUT, so the model can choose to remember important tokens and forget irrelevant ones, dynamically. This 'selective state space' closed much of the quality gap with attention while keeping linear cost — the breakthrough that made SSMs competitive.
| Transformer attention | Mamba (selective SSM) |
|---|---|
| O(T²) compute | O(T) compute |
| KV cache grows with T | Fixed-size state |
| Every token attends to all | Compressed running state |
| Random access to any past token | Must compress past into state |
| Excellent recall of any detail | May lose specific distant details |
| The dominant architecture | Promising challenger / hybrid |
The Trade-off: Compression vs Random Access
The fundamental trade-off: attention keeps EVERY past token available and can attend to any of them precisely (perfect recall, but quadratic cost and growing cache). An SSM compresses the past into a fixed-size state (cheap and constant memory, but the compression can lose specific details). For tasks needing exact recall of a distant token ('what was the 3rd word?'), attention's random access wins; for tasks needing a general summary of a long sequence, the SSM's efficient state can suffice. This trade-off is central to choosing between them.
All the methods so far try to make the CONTEXT WINDOW itself bigger or cheaper. A completely different strategy: don't put everything in context at all. Instead, store information in an EXTERNAL MEMORY and bring relevant pieces INTO context only when needed — like a computer's memory hierarchy. This connects long context to the RAG of Chapter 29 and offers a path to effectively UNBOUNDED memory.
The Memory-Hierarchy Idea
MemGPT (Packer et al., 2023) drew an explicit analogy to operating systems. Just as an OS manages a small fast RAM and a large slow disk — paging data between them as needed — a model can manage a small fast CONTEXT WINDOW and a large external MEMORY store. The model decides what to keep in its limited context and what to 'page out' to external memory, retrieving things back when relevant. This gives the model effectively unbounded memory despite a finite context window.
Tool Trace: MemGPT-style memory management
| User | References something from 1000 messages ago | → |
| Model | Realizes the detail isn't in its current context | • |
| Memory | Model issues a retrieval call to external memory | → |
| Memory | Returns the relevant past information | ← |
| Model | Loads it into context and answers using it | ← |
The Connection to RAG
External memory is essentially RAG (Chapter 29) applied to the model's own history and working memory. The information is chunked, embedded, and stored; when needed, it is retrieved and inserted into the context. The difference is emphasis: RAG retrieves from a knowledge base to answer questions; external memory retrieves from the conversation/task history to extend the model's effective memory. Both rest on the same retrieve-into-context principle.
Models that handle a million or more tokens are real today. They are not achieved by any single trick but by COMBINING the techniques of this chapter. Let us see how the pieces fit together to scale context to extreme lengths.
Pipeline Flow: How 1M+ token contexts are achieved
| 1 | Train short | Pretrain on a manageable context (e.g. 4K–8K) |
| 2 | Extend positions | RoPE scaling / YaRN to reach long context cheaply |
| 3 | Fine-tune long | Brief fine-tuning on long sequences to adapt |
| 4 | Efficient attention | FlashAttention + sparse/linear patterns for the compute |
| 5 | Manage KV cache | Quantize/compress the cache; paging (Ch. 27) |
| 6 | (Maybe) hybrid/SSM | SSM layers or external memory for extreme lengths |
The Combined Recipe
A modern long-context model typically: starts from a short-context pretrained base; extends positions with YaRN or similar; fine-tunes briefly on long data; uses FlashAttention (and often sparse patterns) to make the attention compute feasible; and manages the enormous KV cache with quantization and paging. For the very longest contexts, it may add SSM/hybrid layers or external memory. No single technique reaches 1M tokens — it is the stack of them, each addressing a different part of the problem (position, compute, memory).
| Obstacle | Addressed by |
|---|---|
| Positions out of distribution | RoPE scaling / YaRN + fine-tuning |
| Quadratic attention compute | FlashAttention, sparse/linear attention, SSMs |
| Enormous KV cache | KV-cache quantization, paging, GQA (Ch. 19, 27) |
| Quality degradation | Long-context fine-tuning, careful evaluation |
| Unbounded information | External memory / RAG |
A model that ACCEPTS a million tokens is not the same as a model that USES a million tokens well. A crucial, sobering reality: long-context models often fail to effectively use all the context they can technically accept. Understanding these limitations is essential to using long context wisely.
Lost in the Middle
Recall the 'lost in the middle' finding from Chapter 29 (Liu et al., 2023): models attend best to information at the BEGINNING and END of the context, and worst to the MIDDLE. In a long context, a crucial fact buried in the middle may be effectively ignored. So a model with a 1M-token window might still 'lose' information placed at token 500,000. A big window does not guarantee uniform attention across it.
The Needle in a Haystack Test
The standard long-context evaluation is the 'needle in a haystack' test: hide a specific fact (the needle) at various positions in a long context (the haystack), then ask the model to retrieve it. This measures whether the model can actually find information anywhere in its window. Results vary widely — some models ace it, others degrade badly at certain positions or lengths. Passing this test is necessary but not sufficient: real tasks need REASONING over long context, not just retrieval of one fact.
Evaluation Is Hard
Evaluating long context well is itself an open problem. Needle-in-a-haystack tests retrieval but not reasoning. Benchmarks that require synthesizing information ACROSS a long context (multi-hop reasoning, summarizing a whole book, tracking many entities) are harder to build and where models struggle most. As context windows grow, building evaluations that genuinely test long-context REASONING — not just retrieval — is an active research need.
Long context and RAG (Chapter 29) are competing-and-complementary answers to the same question: how does a model use information beyond its weights? As context windows grow, this relationship is actively shifting, and understanding the trade-off is a key practical judgment.
| Long context | RAG |
|---|---|
| Put everything in the window | Retrieve only relevant pieces |
| Simple — no retrieval system | Needs an index + retriever |
| Bounded by window size | Scales to unbounded knowledge |
| Expensive at full length | Cheaper — small context |
| Model sees all of it (if it can) | Model sees only what's retrieved |
| Lost-in-the-middle risk | Retrieval-quality risk |
When to Use Which
Use LONG CONTEXT when the relevant information is bounded and fits the window, when you want simplicity (no retrieval system to build), or when the task needs the model to see ALL the information together (e.g. reasoning over a whole document). Use RAG when the knowledge base is huge or constantly changing, when cost matters (a small retrieved context is far cheaper than a giant one), or when you want citations. They are not mutually exclusive — many systems retrieve a focused set of documents AND use a long context to hold them.
Let us consolidate the chapter into one map of how long context is achieved and what to keep in mind.
The Three Families of Solutions
| Family | Approach | Examples |
|---|---|---|
| Extend the window | Make positions work longer | RoPE scaling, YaRN |
| Cheapen attention | Beat or sidestep O(T²) | FlashAttn, sparse, linear, Mamba |
| Retrieve into context | Don't hold it all | External memory, RAG |
The Three Ideas to Remember
If you remember three things about long context: First, the QUADRATIC WALL is the core obstacle — attention costs O(T²), so naive long context is infeasible, and every technique is a way around it. Second, POSITION EXTENSION (RoPE scaling, YaRN) lets short-trained models reach long contexts cheaply, while EFFICIENT ATTENTION and SSMs (Mamba) attack the compute. Third, a BIG WINDOW IS NOT FREE CAPABILITY — models may lose information in the middle, reason poorly over long contexts, and cost a lot, so evaluate on your real task and consider RAG as a complement.
Long-Context Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Why long context | More working memory | Holds books, codebases, dialogs |
| Quadratic wall | Attention costs O(T²) | Doubling context 4x's the cost |
| Position encodings | Models break past trained length | RoPE enables extension |
| RoPE scaling / YaRN | Interpolate, don't extrapolate | Extend short models cheaply |
| Efficient attention | FlashAttn (exact), sparse (approx) | Cheaper attention compute |
| Linear attention | Reorder to O(T); recurrent view | RNN ideas reborn |
| Mamba / SSMs | Fixed-size selective state | O(T), but compresses the past |
| External memory | Retrieve into context (MemGPT) | Unbounded, RAG-like |
| Lost in the middle | Big window != uses it well | Evaluate on your real task |
Exercises
Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.
Further reading: “RoFormer: Enhanced Transformer with Rotary Position Embedding” (Su et al., 2021, RoPE). “Extending Context Window via Position Interpolation” (Chen et al., 2023). “YaRN: Efficient Context Window Extension” (Peng et al., 2023). “LongRoPE” (Ding et al., 2024). “Longformer” (Beltagy et al., 2020) and “Big Bird” (Zaheer et al., 2020) for sparse attention. “Transformers are RNNs” (Katharopoulos et al., 2020) for linear attention. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” (Gu & Dao, 2023). “MemGPT” (Packer et al., 2023). “Lost in the Middle” (Liu et al., 2023).
Next → Chapter 34: Agents & Multi-Agent Systems
You now have models with vast capacity (Chapter 32) and long memory (this chapter), able to use tools (Chapter 28), retrieve knowledge (Chapter 29), and reason (Chapter 25). Chapter 34 brings these together into AGENTS — systems that pursue goals autonomously over many steps — and MULTI-AGENT systems, where several agents collaborate, debate, or divide labor to solve problems no single call could. We will explore planning, agent memory, the orchestration of multiple specialized agents, and the emergent behaviours and failure modes of agentic systems. The capabilities of the whole book converge into autonomous, goal-directed AI.