Open Problems & Future Directions
Detailed solutions for the exercises in Chapter 35. Try solving them yourself before checking the answers.
Solution
LLMs were discovered to work — reasoning, coding, conversing emerged from scaling next-token prediction — before any theory explained WHY, so the engineering has outrun the science. We can build and align them without deeply understanding what they have learned or how they compute. For deployment this counsels HUMILITY: because we cannot fully predict their failures or guarantee their behavior from first principles, we should rely on extensive empirical testing, monitor in production, keep humans in the loop for high-stakes uses, and avoid over-trusting systems whose inner workings remain opaque. Capability without comprehension argues for caution proportional to stakes.
Solution
Interpretability underpins safety, trust, debugging, and control because nearly every other problem is harder when we cannot see inside the model: we can't reliably detect deception, predict failures, verify reasoning, or confidently align a system we can't read. If we could fully see inside models — identify when they are being deceptive, understand why they hallucinate or fail, and verify their reasoning mechanistically — alignment would become checkable rather than hopeful, debugging would be principled, and trust could be earned by inspection. Interpretability is high-leverage precisely because progress there ripples through every downstream problem.
Solution
Scalable oversight is the problem of providing reliable training signal and evaluation for systems whose outputs humans can no longer reliably judge. Every alignment method in Part V grounds out in human judgment somewhere; as models exceed our ability to evaluate (proving theorems we can't follow, writing code too complex to review), that grounding weakens — we cannot reward what we cannot assess. It must be solved BEFORE, not after, building superhuman systems, because once a system exceeds our judgment we have no trustworthy way to check whether it is aligned — we'd be deploying something we cannot evaluate, with no recourse if it is subtly misaligned. The oversight mechanism has to exist before the capability does.
Solution
For pattern-matching: models fail on trivial variations of problems they ace, are sensitive to phrasing, and make errors a genuine reasoner wouldn't — consistent with sophisticated interpolation over reasoning-shaped training text. For genuine inference: models generalize to genuinely novel problems, solve multi-step tasks requiring composition, and show internal structure suggesting more than surface mimicry. The honest view is that it is likely BOTH — a spectrum where models have learned real, reusable reasoning procedures that nonetheless remain brittle and incompletely general, more than a parrot but less than a reliable logician. The truth is a nuanced middle, and pinning down where on that spectrum a given model sits is itself an open empirical question.
Solution
There are two camps. 'Only managed': a generative model trained for plausible continuations can always produce plausible falsehoods, so the realistic goal is mitigation — grounding via RAG, calibration, abstention, verification. 'Can be reduced fundamentally': training and architectural changes might make models reliably respect the boundary of their knowledge. A well-calibrated model would have its expressed confidence MATCH its actual accuracy — it would say 'I'm not sure' or abstain when uncertain, and be reliably right when confident, so users could trust its certainty. Achieving that honest, calibrated uncertainty — a model that 'knows what it doesn't know' — would defuse much of hallucination's harm even if rare confabulation persists. My read: management plus better calibration is the near-term path; full elimination is uncertain.
Solution
A trained model is frozen: updating it on new information tends to overwrite old knowledge (catastrophic forgetting), so it cannot accumulate experience the way humans do without expensive, risky retraining. We patch this with RAG (external knowledge), long context (working memory), and agent memory — but the model ITSELF doesn't learn after training. If continual learning were solved, models could improve from use, stay current without retraining, personalize to users over time, and accumulate skills — transforming them from static snapshots into systems that grow. It would also reshape deployment (no costly retraining cycles) and raise new safety questions (a model that keeps changing is harder to evaluate and control).
Solution
'Stochastic parrots': they predict text statistically with no grounding, producing fluent mimicry without understanding. 'World models': to predict text well they must learn structure about the world — implicit physics, causality, other minds — because you cannot reliably predict descriptions of a world without modeling it; some interpretability work finds structured internal representations (of space, game state) suggesting more than surface mimicry. Evidence that would shift me toward 'world models': robust, probe-able internal representations that causally drive correct behavior on genuinely novel situations. Evidence toward 'parrots': systematic failures that track surface statistics and collapse whenever the test truly leaves the training distribution. The likely answer is a partial, imperfect world model — real structure, incompletely grounded — and the question is fascinating precisely because it is still open.
Solution
The data wall: high-quality human text is finite, and the largest models have consumed much of it, so 'just train on more data' is reaching its limit. The human-efficiency gap: a child learns language from orders of magnitude less text than an LLM needs, so current learning methods are deeply sample-INEFFICIENT. These point the same way: future progress may hinge not on more data/compute (the levers that drove recent gains, now running out) but on learning MORE FROM LESS — better algorithms, learning from richer signals (interaction, multimodality), or new paradigms. If the data wall binds, sample efficiency rather than brute scale would define the next era, rewarding deep understanding over sheer resources.
Solution
Evaluation is in crisis because models saturate benchmarks faster than we build them, test data leaks into training (contamination), optimizing for benchmarks games them (Goodhart), and the hardest tasks — the ones we most want to measure — are exactly those humans struggle to judge. A workable approach for near-human systems: DYNAMIC, contamination-resistant evaluation — freshly generated or held-out tasks, adversarial probes, and measurement of real-world IMPACT (does it help users accomplish goals?) rather than static multiple-choice; combined with interpretability-based checks and, where possible, verifiable tasks with ground truth. The aim is evaluation that can't be memorized or gamed and that scales with capability, since reliable measurement underpins all other progress.
Solution
Bull case: the paradigm has repeatedly surprised skeptics, each apparent limit (reasoning, long context, multimodality) has fallen to more scale and clever engineering, and emergent capabilities suggest scaling keeps unlocking new abilities — so continued scaling plus efficiency gains may reach transformative capability. Bear case: signs of diminishing returns are appearing, the data wall looms, reasoning remains brittle, and the deepest problems (genuine understanding, continual learning) may need new ideas, not bigger Transformers. Which is more convincing is genuinely uncertain and reasonable people disagree; a defensible position is that scaling will continue to deliver substantial gains for some time WHILE the hardest problems (robustness, understanding) increasingly require conceptual breakthroughs — so the paradigm goes far but perhaps not all the way alone. The honest answer is that no one knows, which is what makes the field consequential.
Solution
A strong case for MOST IMPORTANT is scalable oversight / alignment: as capabilities grow, getting alignment right becomes higher-stakes, and it must be solved before systems exceed our judgment — the cost of failure is greatest. A strong case for MOST INTERESTING is interpretability: it is intellectually deep (reverse-engineering emergent computation), high-leverage (it would help with alignment, hallucination, and reasoning at once), and newly tractable (sparse autoencoders, circuit analysis are yielding real findings). Reasonable people will weight these differently; the key is that 'important' (consequences) and 'interesting' (tractability and depth) are distinct axes, and the best problems — like interpretability — often score high on both.
Solution
Example (world models): train a small Transformer on move sequences from a simple game (e.g. Othello) where the true board state is known but never given to the model, then train linear probes on its activations to test whether the board state is linearly decodable — and, crucially, INTERVENE on the probed representation and check whether the model's predictions change accordingly. If editing the internal 'board' causally changes behavior, that is evidence of a genuine internal world model, not mere correlation. This is runnable on modest compute and directly probes Exercise 7's question — the kind of careful small experiment that has advanced interpretability.
Solution
Frontier models are expensive to train and run, concentrating them among a few well-resourced organizations — so efficiency (smaller, cheaper models that retain capability) directly shapes who can build and control AI. If only a handful of actors can afford frontier systems, the future is shaped by their choices; if capable models become cheap and widely runnable, control disperses. Broad access would require continued efficiency advances (distillation, quantization, better architectures), open models and tools, affordable compute, and shared datasets — plus thoughtful governance to balance openness against misuse risk. The efficiency research in this book is thus not just about cost; it is about the distribution of power over a transformative technology.
Solution
Take ATTENTION. It first appears in Chapter 9 as a fix for the seq2seq bottleneck (Bahdanau/Luong), letting a decoder look back at all encoder states. Chapter 12 generalizes it into self-attention — every token attending to every other — and Chapter 13 makes it the core of the Transformer. Chapters 19–20 confront its costs (KV cache, GQA, FlashAttention), Chapter 27 optimizes it for inference, and Chapter 33 attacks its quadratic wall with sparse/linear attention and SSM alternatives. Across the book attention evolves from a helpful add-on, to the central architectural primitive, to a scaling bottleneck that the frontier works to transcend — a single idea whose trajectory mirrors the field's: a breakthrough that becomes foundational and then becomes the next thing to overcome.
Solution
Most trustworthy: capabilities with VERIFIABLE outputs — e.g. code that can be run and tested, or arithmetic checkable against ground truth — because correctness can be confirmed, closing the loop (Chapters 25, 34). Least trustworthy: confident factual claims and reasoning in domains WITHOUT verification — because hallucination (Chapter 35.5) and brittle reasoning (35.4) mean fluent, confident output can be wrong with no signal. The calibration principle: trust scales with VERIFIABILITY. For high-stakes use, prefer tasks where the output can be checked, ground answers in retrieved sources with citations (RAG), keep a human in the loop, and treat unverifiable confident assertions with the most skepticism — matching trust to how well the claim can be confirmed.
Solution
Alignment: full interpretability would let us VERIFY a model's goals and detect deception or misalignment directly, turning alignment from behavioral hope into mechanistic check — and easing scalable oversight (we could inspect reasoning we can't otherwise judge). Hallucination: we could see when a model is confabulating versus recalling, enabling reliable detection and abstention. Reasoning: we could verify whether stated reasoning reflects actual computation (CoT faithfulness) and locate where reasoning fails. It is high-leverage because these are otherwise separate, hard problems that all share a root cause — the opacity of the model — so progress on interpretability advances all of them at once, which is why it attracts such intense effort.
Solution
A plausible (necessarily uncertain) picture: likely substantial PROGRESS on long context, multimodal integration, agent reliability in verifiable domains, and efficiency (smaller capable models). Likely still OPEN: deep interpretability of frontier models, robust scalable oversight, reliable reasoning in unverifiable domains, true continual learning, and the paradigm question. NEW problems that may emerge: governing increasingly autonomous agents, securing agentic systems against novel attacks, managing model-generated-data ecosystems (collapse), and evaluating systems that exceed human judgment. The constant is that solving today's problems tends to surface harder ones — the frontier recedes as we approach it, which is part of what makes the field perpetually open.
Solution
This is yours to answer — the book's final invitation. The understanding you have built guides you in three ways: it lets you BUILD with insight (knowing what each component does, its costs, and its failure modes, so you choose architectures and techniques deliberately rather than by fashion); it lets you STUDY the open problems productively (you know the foundations well enough to see their real limits and design experiments that matter); and it lets you QUESTION claims critically (distinguishing what is solved from what is hyped, what is measured from what is asserted). Whether you go into research, engineering, safety, or application, you carry a complete, honest mental model of how these systems work and where they fall short — which is exactly what it takes to push the frontier forward. The rest is up to you.