Part VII: Frontier Techniques & Future

Chapter 33

Long Context & Memory

Extending how much a model can attend to: attention's quadratic wall, RoPE extrapolation and YaRN, efficient attention, state-space models like Mamba, and external memory — reaching 1M+ token contexts.

20 Exercises

Learning Objectives

1.	Explain why long context matters and what it enables.
2.	Understand attention's quadratic cost — the core obstacle to long context.
3.	Understand position encodings and why models fail beyond their trained length.
4.	Apply RoPE and context-extension methods (position interpolation, YaRN).
5.	Understand efficient and sparse attention variants.
6.	Understand state-space models (Mamba) and linear-time sequence modeling.
7.	Understand external memory approaches like MemGPT.
8.	Reason about the 'lost in the middle' and long-context evaluation problems.
9.	Compare strategies for reaching 1M+ token contexts.
10.	Relate long context to RAG and the broader memory question.

Chapter 32 scaled a model's PARAMETERS. This chapter scales something else: its CONTEXT LENGTH — how many tokens it can attend to at once. A model's context window is its working memory: everything it can 'see' while generating the next token. Early models had tiny windows (512–2048 tokens); today's reach hundreds of thousands or even millions. Growing this window is one of the most active frontiers, and it unlocks fundamentally new capabilities.

What Long Context Enables

Capability	Why it needs long context
Whole-book understanding	Hold an entire book in the window at once
Large codebase reasoning	Attend to thousands of files together
Long document analysis	Reason over full contracts, papers, reports
Extended conversations	Remember a long dialogue without forgetting
Many-shot prompting	Hundreds of examples in the prompt
Agentic trajectories	Long chains of tool calls and observations (Ch. 28)
RAG with more retrieved context	Fit more retrieved passages (Ch. 29)

Context Is the Model's Working Memory

Think of the context window as the model's short-term, working memory — distinct from the long-term knowledge baked into its weights. The weights hold what the model learned during training; the context holds what it is thinking about RIGHT NOW. A bigger context window means the model can hold more of the current problem in mind at once: more of the document, more of the conversation, more of the codebase. It does not make the model smarter, but it lets the model apply its intelligence to more information simultaneously.

✧

Context Note: Long Context Changes What's Possible

The jump from a 2K to a 1M token context is not just 'more of the same' — it is qualitatively transformative. At 2K tokens you can hold a few pages; at 1M you can hold entire books, codebases, or hours of transcripts. Whole classes of tasks that required chunking, retrieval, and clever workarounds become possible by just putting everything in the context. Long context is a capability multiplier, which is why so much effort goes into extending it.

But — as we will see — long context is expensive and imperfect. The naive approach (just make the window bigger) runs into a hard computational wall (Section 33.2) and quality problems (Section 33.10). The rest of this chapter is about the clever techniques that make long context actually work, and about its limits and alternatives.

There is a fundamental reason long context is hard, and it goes back to attention itself (Chapter 12). Standard attention has a cost that grows QUADRATICALLY with the sequence length — double the context and you quadruple the attention cost. This 'quadratic wall' is the central obstacle to long context, and nearly every technique in this chapter is a way around it.

Why Attention Is Quadratic

Recall how attention works: every token attends to every other token. With T tokens, that is T × T = T² pairwise interactions — the attention matrix is T-by-T. So the compute and memory for attention grow as T². At T=1,000 that is a million interactions; at T=1,000,000 it is a TRILLION. The cost explodes, which is why simply making the window bigger does not scale.

text•Attention's quadratic cost
Each of T tokens attends to all T tokens → T × T = T² interactions

  T = 1,000      → 1,000,000 interactions
  T = 100,000    → 10,000,000,000
  T = 1,000,000  → 1,000,000,000,000  (a trillion!)

# Compute AND the attention-matrix memory grow as O(T²).
# Doubling context QUADRUPLES the cost. This is the wall.

✧

Intuition: Why Quadratic Growth Is So Punishing

Quadratic growth feels deceptively mild at small scales but becomes brutal at large ones. Going from 1K to 2K tokens quadruples the cost — annoying but manageable. Going from 100K to 200K quadruples a cost that was already enormous. This is why context length couldn't just be 'turned up': each doubling costs four times as much, so reaching 1M tokens naively would cost a million times more than 1K. The wall gets steeper the higher you climb.

There are two ways around the wall: make attention CHEAPER (efficient/sparse/linear attention, Sections 33.5–33.6), or change the ARCHITECTURE so cost grows only linearly (state-space models, Section 33.7). Plus a third path: don't put everything in context at all (external memory and RAG, Section 33.8). The chapter explores all three.

The KV-Cache Cost Too

There is a second long-context cost, from inference (Chapter 27): the KV cache grows LINEARLY with context length, but at long contexts it becomes enormous — storing keys and values for a million tokens consumes vast memory. So long context strains both attention compute (quadratic) and KV-cache memory (linear but large). Both must be addressed to serve long contexts affordably.

Before we can extend context, we must understand a subtler obstacle: even setting aside compute cost, a model trained on 4K tokens usually FAILS if you simply feed it 8K — it produces garbage. The reason lies in POSITION ENCODINGS, the mechanism (Chapter 13) that tells the model WHERE each token is. Understanding position encodings is the key to extending context.

Why Position Matters

Recall that attention itself is order-blind — it sees a set of tokens, not a sequence. Position encodings inject order information so the model knows token 5 comes before token 50. But position encodings are LEARNED or DEFINED for the positions seen during training. Feed the model a position it never saw (position 6,000 when trained only to 4,096), and the position encoding is out of distribution — the model has no idea what that position means, and breaks.

Position encoding

The mechanism that gives a Transformer information about token order, since attention alone is order-blind. The encoding's behaviour at positions beyond the training length is the main obstacle to extending context.

RoPE: Rotary Position Embeddings

The dominant modern position encoding is RoPE (Rotary Position Embedding; Su et al., 2021). Instead of adding a position vector, RoPE ROTATES the query and key vectors by an angle proportional to their position. The dot product in attention then naturally depends on the RELATIVE distance between tokens — an elegant property. RoPE is used in LLaMA, Mistral, and most modern models, and crucially, its mathematical structure makes it EXTENDABLE in ways that earlier encodings were not.

text•RoPE: rotation by position
RoPE rotates query/key vectors by an angle ∝ position:
    q_m → rotate(q_m, mθ)      # token at position m
    k_n → rotate(k_n, nθ)      # token at position n

Their dot product then depends on (m - n), the RELATIVE distance.
# Different frequencies θ encode position at different scales.
# The rotation structure is what makes RoPE extendable (next section).

✧

Context Note: Why RoPE Enabled the Long-Context Era

RoPE's rotational structure turned out to be a gift for extending context. Because position is encoded as a smooth rotation (an angle), you can mathematically 'stretch' or 'reinterpret' those angles to cover longer contexts than trained — something impossible with the older absolute position embeddings. Nearly all the context-extension methods of the next section (position interpolation, NTK scaling, YaRN) exploit RoPE's rotational form.

This is a recurring theme: a design choice made for one reason (RoPE's clean relative-position property) turns out to enable something unforeseen (cheap context extension). Understanding RoPE deeply is the key to understanding how models reach contexts far beyond their training length.

Here is a remarkable fact: you can take a model trained on 4K tokens and EXTEND it to 32K, 128K, or more — often with only a little extra fine-tuning — by cleverly manipulating its RoPE position encodings. These context-extension techniques are how most long-context models are actually made, and they all build on RoPE's rotational structure.

Position Interpolation: Squeeze, Don't Extrapolate

The breakthrough idea (Position Interpolation; Chen et al., 2023) is counterintuitive but simple. Instead of EXTRAPOLATING to new positions the model never saw (which fails), INTERPOLATE: squeeze the new, longer range of positions back into the range the model WAS trained on. To handle 8K tokens with a 4K-trained model, scale every position by 1/2, so positions run 0 to 4,096 (in-distribution) instead of 0 to 8,192 (out-of-distribution). The model sees only familiar position values, just spaced more finely.

✧

Intuition: Stretching a Ruler

Imagine a ruler marked 0 to 4096, and you need to measure something 8192 units long. Extrapolation says 'keep going past the end of the ruler' — but there are no markings there, so you're lost. Interpolation says 'reinterpret the ruler so its 4096 marks now span 8192 units' — each mark covers twice the distance, but every measurement still falls within the ruler's familiar range. The model only ever sees positions it understands; they just mean something slightly different.

This is why interpolation works where extrapolation fails: the model never encounters an unfamiliar position value. The cost is slightly reduced positional precision (the marks are spaced more finely), which a little fine-tuning recovers. This single insight unlocked practical context extension.

NTK-Aware Scaling and YaRN

Plain interpolation works but blurs fine positional detail. Better methods scale different RoPE frequencies DIFFERENTLY. NTK-aware scaling interpolates low frequencies (long-range position) more than high frequencies (local position), preserving local precision. YaRN (Yet another RoPE extensioN; Peng et al., 2023) refines this further with a careful per-frequency scaling plus a temperature adjustment, achieving strong context extension with minimal fine-tuning. YaRN is one of the most widely-used extension methods.

Method	How it extends context
Extrapolation (naive)	Just use larger positions — FAILS (out of distribution)
Position Interpolation	Linearly squeeze positions into the trained range
NTK-aware scaling	Scale low frequencies more than high — preserves local detail
YaRN	Per-frequency scaling + temperature — strong, little fine-tuning
LongRoPE	Search for optimal per-dimension scaling — reaches 2M+ tokens

Python•Position interpolation (the core idea)
import torch

def interpolate_rope_positions(positions, trained_len, target_len):
    """Squeeze longer positions into the trained range (linear PI)."""
    scale = trained_len / target_len      # e.g. 4096/8192 = 0.5
    # Every position is scaled DOWN so it falls in [0, trained_len]
    return positions * scale

# A position-8000 token is treated as position 4000 -- in-distribution.
# Apply this scaling inside RoPE, then briefly fine-tune to adapt.
# YaRN/NTK scale different RoPE frequencies by different amounts instead
# of one uniform factor, preserving local positional precision.

✧

Context Note: Extend, Don't Retrain

The practical magic of these methods: you do not have to retrain a long-context model from scratch (which would be astronomically expensive at the quadratic cost). You take an existing model trained on short context and EXTEND it with position interpolation + a modest fine-tune on some longer sequences. This is how most long-context models are made — a short-context base, extended cheaply. It is one of the most cost-effective tricks at the frontier.

This connects to the chapter's theme: the quadratic wall makes training long-context from scratch impractical, so the field found a workaround — train short, extend smart. RoPE's structure is what makes this possible, which is why we spent a section understanding it.

Context extension (Section 33.4) handles the POSITION problem, but the quadratic COMPUTE problem remains. The second family of solutions attacks attention directly: make it cheaper than O(T²). These efficient-attention methods either compute exact attention more cleverly, or approximate it by having each token attend to FEWER others.

FlashAttention: Same Math, Less Memory

First, a clarification: FlashAttention (Chapter 20) does NOT change attention's quadratic compute — it computes the EXACT same attention, but far more memory-efficiently by never materializing the full T×T matrix. It tiles the computation to keep data in fast on-chip memory. FlashAttention made long context practical by removing the MEMORY blowup of the attention matrix, even though the compute stays quadratic. It is the foundation that most long-context models build on.

Sparse Attention: Attend to Fewer Tokens

To beat the quadratic COMPUTE, sparse attention has each token attend to only a SUBSET of other tokens, not all of them. If each token attends to a fixed-size window or a sparse pattern instead of everything, the cost drops from O(T²) toward O(T). The art is choosing WHICH tokens to attend to so that little important information is lost.

Pattern	What each token attends to
Sliding window	A fixed window of nearby tokens (local attention)
Dilated / strided	Every n-th token, to cover long range cheaply
Global + local	Most tokens attend locally; a few 'global' tokens attend to all
Block-sparse	Attention restricted to blocks (e.g. BigBird, Longformer)
Streaming / sink	Recent window + a few 'attention sink' tokens at the start

A common and effective pattern combines LOCAL attention (each token attends to a nearby window) with a few GLOBAL tokens (that attend to and are attended by everything). Local attention captures nearby detail cheaply; the global tokens carry long-range information. This 'global + local' design (used in Longformer, BigBird) approximates full attention at a fraction of the cost.

✧

Context Note: Sparse Attention Trades Completeness for Speed

Sparse attention is an APPROXIMATION: by attending to fewer tokens, a token might miss a relevant one far away. The bet is that most important information is either nearby (captured by local attention) or routed through global tokens. For many tasks this bet pays off, but sparse attention can miss long-range dependencies that full attention would catch. It is a classic speed-vs-completeness trade-off.

This is why the field also pursued a more radical path: instead of approximating attention, REPLACE it with an architecture that is linear by design. That path leads to state-space models like Mamba (Section 33.7), which abandon attention's all-pairs interaction entirely — a deeper rethinking of how to model sequences.

Sparse attention skips tokens; LINEAR attention takes a different route — it mathematically reformulates attention so its cost grows LINEARLY with sequence length instead of quadratically, while still letting every token influence every other. This is achieved by changing the ORDER of the attention computation using a kernel approximation.

The Reordering Trick

Standard attention computes softmax(QKᵀ)V — the QKᵀ forms the T×T matrix (quadratic). Linear attention approximates the softmax with a kernel feature map, which lets you reorder the computation to (Q')(K'ᵀV) — computing K'ᵀV first. That inner product is a small fixed-size matrix (dimension × dimension), independent of T. The result: cost grows linearly with T. It is the same idea as choosing a cheaper order of matrix multiplication (Chapter 1), applied to attention.

text•Linear attention: reorder the computation
Standard:  softmax(Q Kᵀ) V    — the Q Kᵀ is T × T  (quadratic)

Linear:    φ(Q) (φ(K)ᵀ V)     — compute φ(K)ᵀ V first
           φ(K)ᵀ V is d × d, INDEPENDENT of T

# φ = a kernel feature map approximating softmax.
# Cost becomes O(T) instead of O(T²). Can also run as a recurrence.

The Recurrent View

Linear attention has a beautiful property: it can be written as a RECURRENCE — maintaining a fixed-size 'state' that is updated one token at a time, like an RNN (Chapter 9). This means linear-attention models can process sequences of UNBOUNDED length with constant memory per step, since the state size doesn't grow with context. This recurrent view is the bridge to state-space models, the most successful linear-time architecture, which we turn to next.

✧

Context Note: The RNN Comes Full Circle

There is a striking historical loop here. Part III moved from RNNs (which process sequences step by step with a fixed-size state, but were slow to train and forgot long-range information) to Transformers (which attend to everything in parallel, but cost O(T²)). Now, to handle very long contexts, the field is returning to RNN-like RECURRENCE — fixed-size state, linear cost — but with modern designs that fix the old RNNs' weaknesses. The pendulum swings back, smarter.

Linear attention and state-space models are, in a sense, RNNs reborn: they recover the RNN's efficiency (linear time, constant memory) while addressing why RNNs lost to Transformers. Whether they can fully match Transformer quality is one of the open questions of this chapter — and the field.

State-space models (SSMs), and especially MAMBA (Gu & Dao, 2023), are the most prominent attempt to replace attention with something that scales linearly. They have generated enormous excitement as a potential successor or complement to the Transformer for long sequences. Let us understand them at a beginner level.

The Core Idea: A Compressed Running State

An SSM processes a sequence by maintaining a fixed-size STATE that it updates as it reads each token — like an RNN, but based on the mathematics of continuous state-space systems from control theory. The state is a compressed summary of everything seen so far. Because the state is fixed-size, processing each token costs the same regardless of how long the sequence is — LINEAR total cost and CONSTANT memory per step, no quadratic wall, no growing KV cache.

Arch Stack: A state-space model: a fixed-size running state

output token	from current state
fixed-size STATE	compressed summary of the past
update rule	state = A·state + B·input
input token	(d,)

Mamba's Innovation: Selectivity

Classical SSMs had a weakness: their state update was FIXED — the same for every token — so they couldn't decide what to remember or forget based on content. Mamba's key innovation is SELECTIVITY: it makes the state update DEPEND ON THE INPUT, so the model can choose to remember important tokens and forget irrelevant ones, dynamically. This 'selective state space' closed much of the quality gap with attention while keeping linear cost — the breakthrough that made SSMs competitive.

Transformer attention	Mamba (selective SSM)
O(T²) compute	O(T) compute
KV cache grows with T	Fixed-size state
Every token attends to all	Compressed running state
Random access to any past token	Must compress past into state
Excellent recall of any detail	May lose specific distant details
The dominant architecture	Promising challenger / hybrid

The Trade-off: Compression vs Random Access

The fundamental trade-off: attention keeps EVERY past token available and can attend to any of them precisely (perfect recall, but quadratic cost and growing cache). An SSM compresses the past into a fixed-size state (cheap and constant memory, but the compression can lose specific details). For tasks needing exact recall of a distant token ('what was the 3rd word?'), attention's random access wins; for tasks needing a general summary of a long sequence, the SSM's efficient state can suffice. This trade-off is central to choosing between them.

✧

Context Note: Hybrid Models: Best of Both

Because attention and SSMs have complementary strengths — attention's precise recall vs SSM's linear efficiency — many recent models are HYBRIDS, interleaving attention layers and SSM (Mamba) layers. The SSM layers handle the bulk of sequence processing cheaply, while a few attention layers provide the precise recall that pure SSMs lack. Models like Jamba take this approach, getting much of attention's quality at much of the SSM's efficiency.

This pragmatic hybridization is typical of how the frontier evolves: rather than one architecture wholly replacing another, the strengths of each get combined. Whether pure SSMs, hybrids, or improved attention will dominate long-context modeling is genuinely open — an active and exciting research question.

All the methods so far try to make the CONTEXT WINDOW itself bigger or cheaper. A completely different strategy: don't put everything in context at all. Instead, store information in an EXTERNAL MEMORY and bring relevant pieces INTO context only when needed — like a computer's memory hierarchy. This connects long context to the RAG of Chapter 29 and offers a path to effectively UNBOUNDED memory.

The Memory-Hierarchy Idea

MemGPT (Packer et al., 2023) drew an explicit analogy to operating systems. Just as an OS manages a small fast RAM and a large slow disk — paging data between them as needed — a model can manage a small fast CONTEXT WINDOW and a large external MEMORY store. The model decides what to keep in its limited context and what to 'page out' to external memory, retrieving things back when relevant. This gives the model effectively unbounded memory despite a finite context window.

External memory

A store outside the model's context window, holding information that is retrieved into context only when relevant — giving effectively unbounded memory with a finite window, analogous to an OS paging between RAM and disk.

Tool Trace: MemGPT-style memory management

User	References something from 1000 messages ago	→
Model	Realizes the detail isn't in its current context	•
Memory	Model issues a retrieval call to external memory	→
Memory	Returns the relevant past information	←
Model	Loads it into context and answers using it	←

The Connection to RAG

External memory is essentially RAG (Chapter 29) applied to the model's own history and working memory. The information is chunked, embedded, and stored; when needed, it is retrieved and inserted into the context. The difference is emphasis: RAG retrieves from a knowledge base to answer questions; external memory retrieves from the conversation/task history to extend the model's effective memory. Both rest on the same retrieve-into-context principle.

✧

Context Note: Two Philosophies of Long Context

There are fundamentally two philosophies. The FIT-IT-ALL-IN approach (bigger windows, efficient attention, SSMs) tries to let the model attend to everything directly — simple to use, but expensive and bounded. The RETRIEVE-WHAT'S-NEEDED approach (external memory, RAG) keeps the context small and pulls in only relevant pieces — scalable to unbounded information, but dependent on retrieval quality. They are complementary, and the best systems often combine them.

As context windows grow, the line between them blurs: with a 1M-token window you can fit a lot directly, reducing the need to retrieve. But no window is infinite, and retrieval scales to information that no window could hold. The two approaches will likely coexist, with the balance shifting as windows grow and retrieval improves.

Models that handle a million or more tokens are real today. They are not achieved by any single trick but by COMBINING the techniques of this chapter. Let us see how the pieces fit together to scale context to extreme lengths.

Pipeline Flow: How 1M+ token contexts are achieved

1	Train short	Pretrain on a manageable context (e.g. 4K–8K)
2	Extend positions	RoPE scaling / YaRN to reach long context cheaply
3	Fine-tune long	Brief fine-tuning on long sequences to adapt
4	Efficient attention	FlashAttention + sparse/linear patterns for the compute
5	Manage KV cache	Quantize/compress the cache; paging (Ch. 27)
6	(Maybe) hybrid/SSM	SSM layers or external memory for extreme lengths

The Combined Recipe

A modern long-context model typically: starts from a short-context pretrained base; extends positions with YaRN or similar; fine-tunes briefly on long data; uses FlashAttention (and often sparse patterns) to make the attention compute feasible; and manages the enormous KV cache with quantization and paging. For the very longest contexts, it may add SSM/hybrid layers or external memory. No single technique reaches 1M tokens — it is the stack of them, each addressing a different part of the problem (position, compute, memory).

Obstacle	Addressed by
Positions out of distribution	RoPE scaling / YaRN + fine-tuning
Quadratic attention compute	FlashAttention, sparse/linear attention, SSMs
Enormous KV cache	KV-cache quantization, paging, GQA (Ch. 19, 27)
Quality degradation	Long-context fine-tuning, careful evaluation
Unbounded information	External memory / RAG

✧

Context Note: Long Context Is a Systems Achievement

Reaching 1M tokens is, like much of Part VI–VII, a SYSTEMS achievement as much as an algorithmic one. It requires position extension, efficient attention kernels, KV-cache management, distributed serving, and careful fine-tuning — all working together. The headline number ('1M token context!') hides a stack of coordinated engineering. This is the nature of the frontier: capabilities emerge from integrating many techniques, not from one breakthrough.

And as the next section shows, a large context WINDOW does not automatically mean the model USES all of it well. Reaching 1M tokens is necessary but not sufficient; the model must also effectively attend to and reason over that much context — a separate and harder problem.

A model that ACCEPTS a million tokens is not the same as a model that USES a million tokens well. A crucial, sobering reality: long-context models often fail to effectively use all the context they can technically accept. Understanding these limitations is essential to using long context wisely.

Lost in the Middle

Recall the 'lost in the middle' finding from Chapter 29 (Liu et al., 2023): models attend best to information at the BEGINNING and END of the context, and worst to the MIDDLE. In a long context, a crucial fact buried in the middle may be effectively ignored. So a model with a 1M-token window might still 'lose' information placed at token 500,000. A big window does not guarantee uniform attention across it.

The Needle in a Haystack Test

The standard long-context evaluation is the 'needle in a haystack' test: hide a specific fact (the needle) at various positions in a long context (the haystack), then ask the model to retrieve it. This measures whether the model can actually find information anywhere in its window. Results vary widely — some models ace it, others degrade badly at certain positions or lengths. Passing this test is necessary but not sufficient: real tasks need REASONING over long context, not just retrieval of one fact.

⚠️

A Big Window Is Not Free Capability

Beware the marketing-vs-reality gap in long context. A model advertised with a 1M-token window may: degrade in quality as the context fills, ignore information in the middle, reason poorly over the full context (even if it can retrieve single facts), and cost enormously to run at full length. The advertised window is the MAXIMUM the model accepts, not a guarantee it uses that much effectively. Always evaluate long-context performance on YOUR actual task, not just the headline number.

This echoes the 'more context is not always better' lesson from RAG (Chapter 29). Often, a focused, well-curated context outperforms a giant one stuffed with marginally-relevant information — both because of lost-in-the-middle effects and because irrelevant context distracts the model. Use the long window when it helps; don't assume bigger is always better.

Evaluation Is Hard

Evaluating long context well is itself an open problem. Needle-in-a-haystack tests retrieval but not reasoning. Benchmarks that require synthesizing information ACROSS a long context (multi-hop reasoning, summarizing a whole book, tracking many entities) are harder to build and where models struggle most. As context windows grow, building evaluations that genuinely test long-context REASONING — not just retrieval — is an active research need.

Long context and RAG (Chapter 29) are competing-and-complementary answers to the same question: how does a model use information beyond its weights? As context windows grow, this relationship is actively shifting, and understanding the trade-off is a key practical judgment.

Long context	RAG
Put everything in the window	Retrieve only relevant pieces
Simple — no retrieval system	Needs an index + retriever
Bounded by window size	Scales to unbounded knowledge
Expensive at full length	Cheaper — small context
Model sees all of it (if it can)	Model sees only what's retrieved
Lost-in-the-middle risk	Retrieval-quality risk

When to Use Which

Use LONG CONTEXT when the relevant information is bounded and fits the window, when you want simplicity (no retrieval system to build), or when the task needs the model to see ALL the information together (e.g. reasoning over a whole document). Use RAG when the knowledge base is huge or constantly changing, when cost matters (a small retrieved context is far cheaper than a giant one), or when you want citations. They are not mutually exclusive — many systems retrieve a focused set of documents AND use a long context to hold them.

✧

Context Note: The Shifting Frontier

As context windows grow from thousands to millions of tokens, the balance between long context and RAG shifts. With a 1M-token window, you can simply put a whole book or codebase in context, reducing the need for retrieval. But RAG never becomes obsolete: no window holds the entire internet or an enterprise's whole document store, and retrieval is far cheaper than processing a giant context. The likely future is hybrid — retrieve a relevant subset, then use a long context to reason over it richly.

This is a live, evolving question. The 'just make the context bigger' camp and the 'retrieve what you need' camp each have momentum, and the right answer keeps shifting with window sizes, costs, and retrieval quality. Understanding both — as you now do — is what lets you choose well as the frontier moves.

Let us consolidate the chapter into one map of how long context is achieved and what to keep in mind.

The Three Families of Solutions

Family	Approach	Examples
Extend the window	Make positions work longer	RoPE scaling, YaRN
Cheapen attention	Beat or sidestep O(T²)	FlashAttn, sparse, linear, Mamba
Retrieve into context	Don't hold it all	External memory, RAG

The Three Ideas to Remember

If you remember three things about long context: First, the QUADRATIC WALL is the core obstacle — attention costs O(T²), so naive long context is infeasible, and every technique is a way around it. Second, POSITION EXTENSION (RoPE scaling, YaRN) lets short-trained models reach long contexts cheaply, while EFFICIENT ATTENTION and SSMs (Mamba) attack the compute. Third, a BIG WINDOW IS NOT FREE CAPABILITY — models may lose information in the middle, reason poorly over long contexts, and cost a lot, so evaluate on your real task and consider RAG as a complement.

✧

Context Note: Long Context Embodies the Frontier's Pattern

Like MoE before it, long context shows the frontier's pattern: a fundamental limit (the quadratic wall) is overcome not by one breakthrough but by a stack of clever techniques — position tricks, efficient kernels, new architectures, external memory — each rethinking a piece of what seemed fixed. And like MoE, it remains unfinished: the right long-context architecture, the limits of SSMs, and how to evaluate and truly use long context are all open.

You now understand how models reach and use long contexts, and the trade-offs involved. The next chapter pushes the frontier further still — to agents that use these long-context, tool-using, reasoning models to autonomously pursue goals over many steps, individually and in teams.

Long-Context Quick-Reference

Concept	Key idea	Remember
Why long context	More working memory	Holds books, codebases, dialogs
Quadratic wall	Attention costs O(T²)	Doubling context 4x's the cost
Position encodings	Models break past trained length	RoPE enables extension
RoPE scaling / YaRN	Interpolate, don't extrapolate	Extend short models cheaply
Efficient attention	FlashAttn (exact), sparse (approx)	Cheaper attention compute
Linear attention	Reorder to O(T); recurrent view	RNN ideas reborn
Mamba / SSMs	Fixed-size selective state	O(T), but compresses the past
External memory	Retrieve into context (MemGPT)	Unbounded, RAG-like
Lost in the middle	Big window != uses it well	Evaluate on your real task

Exercises

Exercises 1–10 are pen-and-paper or derivations; 11–20 require code.

✎

Exercise 1: Pen & Paper

Explain why context length is the model's 'working memory'. Give three tasks that long context enables and why each needs it.

✎

Exercise 2: Pen & Paper

Explain why attention is O(T²). Compute the number of pairwise interactions at T=1K, 100K, and 1M, and explain why quadratic growth is so punishing.

✎

Exercise 3: Pen & Paper

Why does a model trained on 4K tokens break when fed 8K, even ignoring compute cost? Connect your answer to position encodings.

✎

Exercise 4: Pen & Paper

Explain RoPE at a high level. Why does its rotational structure make context extension possible where absolute position embeddings don't?

✎

Exercise 5: Pen & Paper

Explain position interpolation using the ruler analogy. Why does interpolation succeed where extrapolation fails?

✎

Exercise 6: Pen & Paper

Compare position interpolation, NTK-aware scaling, and YaRN. What does scaling different frequencies differently achieve?

✎

Exercise 7: Pen & Paper

Distinguish FlashAttention from sparse attention. Which changes the compute complexity and which only the memory? Explain.

✎

Exercise 8: Derive

Show how linear attention reorders softmax(QK^T)V to phi(Q)(phi(K)^T V) and why this changes the cost from O(T^2) to O(T).

✎

Exercise 9: Pen & Paper

Explain state-space models and Mamba's 'selectivity'. What is the core trade-off between attention and SSMs (random access vs compression)?

✎

Exercise 10: Pen & Paper

Explain external memory (MemGPT) using the OS RAM/disk analogy. How does it relate to RAG, and when would you use it over a bigger window?

✎

Exercise 11: Code

Measure attention's quadratic cost empirically: time a full-attention forward pass at increasing sequence lengths and plot the O(T^2) growth.

✎

Exercise 12: Code

Implement RoPE from scratch and verify that the attention score between two tokens depends only on their relative distance.

✎

Exercise 13: Code

Implement position interpolation: take a RoPE model and scale positions to extend its context. Test whether it can use positions beyond its trained length.

✎

Exercise 14: Code

Implement YaRN-style per-frequency RoPE scaling and compare its long-context perplexity to plain linear interpolation.

✎

Exercise 15: Code

Implement sliding-window (local) attention with a few global tokens. Compare its cost and quality to full attention on a long-sequence task.

✎

Exercise 16: Code

Implement linear attention (the reordered form) and verify it scales linearly with sequence length. Compare its quality to full attention on a toy task.

✎

Exercise 17: Code Lab

Implement a minimal selective state-space layer (Mamba-style): a fixed-size state with an input-dependent update. Process a long sequence in linear time and compare memory use to attention.

✎

Exercise 18: Code Lab

Build a needle-in-a-haystack test: hide a fact at various positions in a long context and measure retrieval accuracy by position. Demonstrate the lost-in-the-middle effect.

✎

Exercise 19: Code

Build a MemGPT-style external memory: store conversation history, retrieve relevant pieces into a small context window on demand, and answer a question referencing distant history.

✎

Exercise 20: Code (Challenge)

Take a short-context model and extend it to a long context end-to-end: apply YaRN position scaling, fine-tune briefly on long sequences, add a sliding-window+global sparse attention pattern, and manage the KV cache. Evaluate with a needle-in-a-haystack test across positions AND a multi-hop reasoning task over long context, and compare against a RAG baseline that retrieves a small relevant context. Report where long context wins, where RAG wins, and where lost-in-the-middle hurts.

Further reading: “RoFormer: Enhanced Transformer with Rotary Position Embedding” (Su et al., 2021, RoPE). “Extending Context Window via Position Interpolation” (Chen et al., 2023). “YaRN: Efficient Context Window Extension” (Peng et al., 2023). “LongRoPE” (Ding et al., 2024). “Longformer” (Beltagy et al., 2020) and “Big Bird” (Zaheer et al., 2020) for sparse attention. “Transformers are RNNs” (Katharopoulos et al., 2020) for linear attention. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” (Gu & Dao, 2023). “MemGPT” (Packer et al., 2023). “Lost in the Middle” (Liu et al., 2023).

Next → Chapter 34: Agents & Multi-Agent Systems

You now have models with vast capacity (Chapter 32) and long memory (this chapter), able to use tools (Chapter 28), retrieve knowledge (Chapter 29), and reason (Chapter 25). Chapter 34 brings these together into AGENTS — systems that pursue goals autonomously over many steps — and MULTI-AGENT systems, where several agents collaborate, debate, or divide labor to solve problems no single call could. We will explore planning, agent memory, the orchestration of multiple specialized agents, and the emergent behaviours and failure modes of agentic systems. The capabilities of the whole book converge into autonomous, goal-directed AI.

✎ 20 Exercises in this chapter

Attempt each exercise before checking the worked solutions.

View Solutions →

←

PreviousCh 32. Mixture of Experts

NextCh 34. Agents & Multi-Agent Systems

→