Agents & Multi-Agent Systems
We have built models that can reason (Chapter 25), use tools (Chapter 28), retrieve knowledge (Chapter 29), and remember long contexts (Chapter 33). This chapter brings them together into AGENTS — systems that pursue GOALS autonomously over many steps, deciding for themselves what to do next. An agent is where all the book's capabilities converge into something that can act in the world with a degree of independence.
Agent vs Single Model Call
A normal model call is a single round-trip: you ask, it answers, done. An AGENT is fundamentally different: given a GOAL, it operates in a LOOP — deciding what to do, taking an action (often a tool call), observing the result, deciding the next action, and continuing until the goal is achieved. The agent is not just responding; it is autonomously working toward an objective across many steps, charting its own course.
| Single model call | Agent |
|---|---|
| One question, one answer | A goal pursued over many steps |
| You drive each step | The agent decides the next step |
| Stateless round-trip | Maintains state across a loop |
| No tools, or one call | Orchestrates many tools |
| You: 'What is X?' | You: 'Accomplish goal G' |
| Reactive | Autonomous, goal-directed |
The Ingredients of an Agent
An agent combines several capabilities from across this book into a coherent goal-pursuing system. The model is the 'brain' that decides; tools let it act; memory lets it track progress; planning lets it sequence steps; reflection lets it correct mistakes. The agent loop ties them together. Each ingredient was covered earlier; the agent is their integration.
Arch Stack: The ingredients of an agent
| Goal | what the agent is trying to achieve |
| Model (the brain) | reasons and decides (Ch. 25) |
| Planning | decomposes the goal into steps (§34.3) |
| Tools | let the agent act in the world (Ch. 28) |
| Memory | tracks progress and recalls the past (§34.5) |
| Reflection | checks and corrects its own work (§34.4) |
At the heart of every agent is a LOOP. The agent perceives its situation, decides what to do, acts, observes the result, and repeats — until the goal is met. This generalizes the tool-calling loop of Chapter 28 and the ReAct framework of Chapter 28 into the core engine of autonomous behaviour. Understanding the loop is understanding agents.
Tool Trace: The agent loop in action: 'Book a table for 4 tomorrow at 7pm'
| Agent | Plan: find restaurants, check availability, book one | • |
| Agent | Action: search('restaurants near me, 4 people') | → |
| Tool | Returns a list of nearby restaurants | ← |
| Agent | Observe + decide: check the top option's availability | • |
| Agent | Action: check_availability(restaurant, tomorrow 7pm, 4) | → |
| Tool | Returns: available | ← |
| Agent | Action: book(restaurant, tomorrow 7pm, 4) | → |
| Agent | Goal achieved → report success to the user | ← |
# Given a goal, loop until done (with a step cap)
state = initial_observation(goal)
for step in range(max_steps):
thought = model.reason(goal, state, memory) # decide what to do
if thought.is_done:
return thought.answer
action = thought.action # often a tool call
observation = execute(action) # act in the world
state = update(state, action, observation) # observe
memory.record(action, observation) # remember
# The loop is the engine; planning, reflection, memory enrich itGeneralizing ReAct
The agent loop is essentially the ReAct pattern (Chapter 28) — Thought, Action, Observation — scaled up with planning, memory, and the autonomy to pursue a multi-step goal. ReAct gave us the basic interleaving of reasoning and acting; the agent loop adds the structure to sustain it over long, goal-directed trajectories. Everything in the rest of this chapter — planning, reflection, memory, multi-agent — enriches this fundamental loop.
A complex goal cannot be achieved in one action — it must be broken into a sequence of smaller steps. PLANNING is how an agent decomposes a goal into a structured set of subtasks and decides the order to tackle them. Good planning is what lets agents handle complex, multi-step objectives rather than just simple ones.
Why Planning Helps
Without planning, an agent reacts step by step with no overall strategy — it may pursue a dead end, miss a necessary prerequisite, or wander. Planning first — 'to book a trip I need to: choose dates, find flights, find a hotel, book both' — gives the agent a roadmap. It can then execute the plan, adapting as it goes. Planning is the difference between purposeful progress and aimless flailing.
| Planning approach | How it works |
|---|---|
| Plan-then-execute | Make a full plan upfront, then carry it out step by step |
| Interleaved (ReAct) | Plan and act together; replan as you observe results |
| Decomposition | Break the goal into subtasks, possibly recursively |
| Tree-of-thoughts | Explore multiple plan branches, pick the best (Ch. 25 search) |
| Plan + reflect | Make a plan, critique it, revise before executing |
Plan-Then-Execute vs Interleaved
Two broad styles. PLAN-THEN-EXECUTE makes a complete plan first, then runs it — good for predictable tasks, but brittle if reality diverges from the plan. INTERLEAVED planning (the ReAct style) plans a step, acts, observes, and replans based on what happened — more adaptive, handling surprises, but potentially less coherent over long horizons. Many agents combine them: a high-level plan that is refined and adapted as execution reveals new information.
Tool Trace: Planning then executing a complex goal
| Agent | Goal: 'Write a report on our Q3 sales' | • |
| Agent | Plan: (1) get sales data (2) analyze trends (3) draft (4) review | • |
| Agent | Step 1: query the sales database | → |
| Tool | Returns Q3 sales figures | ← |
| Agent | Step 2: analyze — reasons over the figures | • |
| Agent | Step 3: draft the report; Step 4: review and revise | • |
| Agent | Plan complete → deliver the report | ← |
Agents make mistakes — a wrong tool call, a flawed step, a misread result. REFLECTION is the ability of an agent to EVALUATE its own work, recognize problems, and correct course. It is one of the most powerful techniques for making agents reliable, and it builds on the 'models judge better than they generate' asymmetry we saw in RLHF (Chapter 23) and Constitutional AI (Chapter 26).
The Reflect-and-Revise Pattern
Reflection adds a CRITIQUE step to the agent loop. After producing a result (or completing a step), the agent steps back and asks: 'Is this correct? Does it achieve the goal? What's wrong with it?'. Based on this self-critique, it revises and retries. This catches errors the agent would otherwise propagate, and often dramatically improves output quality — the same way a writer improves a draft by re-reading it critically.
Tool Trace: Reflection: the agent critiques and fixes its own work
| Agent | Writes code to solve the task | • |
| Tool | Runs the code → test fails with an error | ← |
| Agent | Reflect: 'The error says index out of range — I had an off-by-one' | • |
| Agent | Revise: fixes the bug and resubmits | → |
| Tool | Runs again → tests pass | ← |
| Agent | Reflect: 'Tests pass, goal achieved' → done | ← |
Reflection Frameworks
Several frameworks formalize reflection. Reflexion (Shinn et al., 2023) has agents verbally reflect on failures and store those reflections to do better on retries. Self-refine has the model iteratively critique and improve its own output. Critic/actor patterns separate a 'doer' agent from a 'critic' agent that reviews its work. All exploit the same insight: a model is often better at SPOTTING a problem in a result than at AVOIDING it in the first place.
An agent working over many steps needs MEMORY — to track what it has done, remember key facts, and recall relevant past experience. Memory is what lets an agent maintain coherence over a long task and learn across tasks. It comes in several kinds, mirroring human memory, and connects directly to the long-context and external-memory ideas of Chapter 33.
Types of Agent Memory
| Memory type | Holds | Implemented as |
|---|---|---|
| Working / short-term | Current task state, recent steps | The context window |
| Long-term | Facts learned across tasks/sessions | External store + retrieval |
| Episodic | Records of past experiences/tasks | Stored trajectories |
| Semantic | General knowledge, learned facts | Knowledge base / RAG |
| Procedural | How to do things, learned skills | Stored routines / examples |
Short-Term vs Long-Term
SHORT-TERM (working) memory is the agent's current context — the recent steps, the active task state — held in the context window (Chapter 33). It is fast but limited and lost when the context fills or the session ends. LONG-TERM memory persists across the limited window and across sessions: important information is written to an external store and retrieved when relevant (exactly the MemGPT/RAG idea from Chapters 29 and 33). Together they let an agent both focus on the now and draw on accumulated experience.
Tool Trace: An agent using long-term memory
| Agent | Starts a task; checks memory for relevant past experience | → |
| Memory | Retrieves: 'last time, the API needed auth in a header' | ← |
| Agent | Applies the recalled lesson, avoiding the past mistake | • |
| Agent | Completes the task; writes new lessons to memory | → |
| Memory | Stores the experience for future tasks | • |
A capable agent has many tools (Chapter 28) — search, code execution, databases, APIs, file access — and must ORCHESTRATE them: choosing the right tool for each step, sequencing them, passing results between them, and combining their outputs. Tool orchestration is where the tool-calling of Chapter 28 scales up into coordinated, multi-tool problem-solving.
The Orchestration Challenges
With many tools, the agent faces new challenges beyond single tool calls. It must SELECT the right tool from many (harder with more options). It must SEQUENCE tools correctly (some depend on others' outputs — the sequential calls of Chapter 28). It must PASS DATA between tools (the output of one becomes the input to another). And it must HANDLE failures gracefully so one tool's error doesn't derail the whole task. Good orchestration makes a many-tool agent feel coherent rather than chaotic.
Tool Trace: Orchestrating multiple tools for a research task
| Agent | Goal: 'Summarize recent news about company X' | • |
| Agent | Tool 1: web_search('company X recent news') | → |
| Tool | Returns article URLs | ← |
| Agent | Tool 2: fetch_page(top URLs) — uses search's output | → |
| Tool | Returns article contents | ← |
| Agent | Tool 3: summarize — reasons over fetched content | • |
| Agent | Combines into a coherent summary → deliver | ← |
Managing Many Tools
As the number of tools grows, selection becomes harder — the model must pick correctly from dozens or hundreds. Techniques help: grouping related tools, retrieving only the RELEVANT tools for the current task (tool RAG — embed tool descriptions, retrieve the ones matching the step), and hierarchical organization (high-level tools that internally use lower-level ones). The same description-quality lesson from Chapter 28 applies: clear tool descriptions are what let the agent orchestrate many tools well.
So far, one agent. But some problems are better solved by MULTIPLE agents working together — each specialized, collaborating, debating, or dividing labor. Multi-agent systems are an active frontier, promising to tackle complex problems beyond any single agent. Let us understand when and how multiple agents help.
Why Use Multiple Agents?
Several motivations. SPECIALIZATION: different agents can be experts at different things (a coder agent, a reviewer agent, a researcher agent), each with tailored tools and prompts. SEPARATION OF CONCERNS: breaking a complex task across focused agents can be more reliable than one agent juggling everything. DIVERSE PERSPECTIVES: multiple agents can debate or critique each other, surfacing errors a single agent would miss. PARALLELISM: independent subtasks can be handled by different agents simultaneously.
| Benefit | How multiple agents help |
|---|---|
| Specialization | Each agent expert at one role, with tailored tools/prompts |
| Separation of concerns | Focused agents are more reliable than one doing everything |
| Diverse perspectives | Agents debate/critique, catching errors (like reflection) |
| Parallelism | Independent subtasks handled simultaneously |
| Modularity | Easier to build, test, and improve focused agents |
Multiple agents must be ORGANIZED — who decides what, who talks to whom, how work flows. Several coordination patterns have emerged, each suited to different problems. Knowing them helps you design multi-agent systems and understand frameworks like AutoGen and LangGraph.
| Pattern | How it works |
|---|---|
| Orchestrator-worker | A manager agent plans and delegates subtasks to worker agents |
| Pipeline / sequential | Agents in a chain; each does its stage, passes to the next |
| Debate | Agents argue different positions; converge or a judge decides |
| Hierarchical | Managers of managers — nested delegation |
| Blackboard / shared state | Agents read/write a shared workspace, coordinating loosely |
| Group chat | Agents converse in a shared thread, taking turns (AutoGen) |
Orchestrator-Worker: The Most Common Pattern
The dominant multi-agent pattern is ORCHESTRATOR-WORKER (also called manager-worker or supervisor). One ORCHESTRATOR agent receives the goal, breaks it into subtasks, and delegates each to a specialized WORKER agent. The workers execute and report back; the orchestrator synthesizes their results and decides next steps. It mirrors a manager delegating to a team — clear responsibility, easy to reason about, and flexible.
Tool Trace: Orchestrator-worker coordination
| Orchestrator | Goal: 'Build a feature' → plans and delegates | • |
| Worker | Coder agent: writes the implementation | → |
| Worker | Tester agent: writes and runs tests | → |
| Worker | Reviewer agent: reviews the code for quality | → |
| Orchestrator | Synthesizes results; requests fixes if needed | ← |
| Orchestrator | Feature complete → report to user | ← |
Debate and Critique
In DEBATE patterns, multiple agents argue different positions or independently solve a problem, then compare and critique each other's answers — often with a judge agent deciding, or the agents converging through discussion. This can improve accuracy and surface errors (different agents catch different mistakes), echoing self-consistency (Chapter 25) and reflection. The diversity of independent attempts is the value: agents that reason differently are unlikely to all make the same mistake.
Multi-agent systems are promising but come with real challenges, and a sober view is essential. More agents introduce more coordination overhead, more places to fail, and more cost. Sometimes a single well-designed agent beats a complex multi-agent system.
| Challenge | What goes wrong |
|---|---|
| Coordination overhead | Agents spend effort communicating, not solving |
| Error propagation | One agent's mistake cascades through the others |
| Cost multiplication | Many agents = many model calls = high cost |
| Misalignment | Agents pursue subtly different interpretations of the goal |
| Conversation loops | Agents get stuck talking in circles without progress |
| Latency | Sequential agent steps add up to slow end-to-end time |
| Debugging difficulty | Hard to trace why a multi-agent system failed |
Error Propagation and Cost
Two challenges stand out. ERROR PROPAGATION: in a chain or team of agents, one agent's mistake feeds into the next, compounding (the same multiplication problem as multi-tool orchestration, now across agents). COST: each agent is one or more model calls, so a multi-agent system can make many times more calls than a single agent — multiplying both cost and latency. A debate among 5 agents over 3 rounds is 15+ model calls for one answer.
The recurring theme of agents is the gap between an impressive DEMO and a reliable SYSTEM. Agents that work in a demo often fail on the messy long tail of real tasks. Building agents that work CONSISTENTLY requires deliberate engineering for reliability, drawing together the lessons of this chapter and Chapter 28.
Reliability Techniques
The Compounding-Reliability Problem
The core reliability challenge is COMPOUNDING. An agent that takes many steps, each with some failure probability, has a success rate that is the PRODUCT of the per-step rates — which decays fast. Twenty steps at 95% each gives ~36% end-to-end success. This is why per-step reliability matters so much, and why reflection, verification, and error recovery (which catch and fix per-step failures before they compound) are essential. The longer the agent's trajectory, the more this matters.
End-to-end success ≈ (per-step success rate) ^ (number of steps)
10 steps at 99% each → 0.99^10 ≈ 90%
20 steps at 95% each → 0.95^20 ≈ 36%
50 steps at 90% each → 0.90^50 ≈ 0.5%
# Errors COMPOUND. Per-step reliability + recovery is everything.Agents are one of the most hyped and most rapidly-evolving areas of AI. A grounded view of where they actually work today — and where they struggle — helps separate the promise from the reality.
Where Agents Work Well Today
| Domain | Why agents work there |
|---|---|
| Coding agents | Work is VERIFIABLE — run tests, see errors, fix and retry |
| Research / browsing | Decompose into searches; synthesize findings |
| Data analysis | Generate and run code, inspect results, iterate |
| Customer support | Bounded tasks with clear tools (lookup, ticket, refund) |
| Workflow automation | Well-defined multi-step processes with clear tools |
Notice the pattern in where agents succeed: VERIFIABILITY and BOUNDED SCOPE. Coding agents thrive because they can run tests and see whether the code works — grounded feedback closes the reflection loop. Bounded tasks with clear tools (support, automation) succeed because the agent's choices are constrained. Agents struggle most with open-ended, long-horizon, hard-to-verify tasks where mistakes compound and there's no clear feedback signal.
The Honest State of Agents
As of this writing, agents are powerful but unreliable for complex open-ended tasks. They shine in verifiable, bounded domains (especially coding) and increasingly handle real workflows, but long-horizon autonomous operation remains fragile — they get stuck, make compounding errors, and need human oversight. The trajectory is rapidly improving (better reasoning models, better tools, better frameworks), but the gap between agent demos and reliable agent products is still real. Deploy agents where the task is verifiable and bounded; be cautious where it is open-ended and high-stakes.
Let us consolidate the chapter into one picture of how an agent — single or multi — is built from the capabilities of this book.
Pipeline Flow: Building a capable agent
| 1 | Goal + loop | An agent loop pursuing a goal, with step/cost bounds |
| 2 | Plan | Decompose the goal into steps; replan as needed |
| 3 | Orchestrate tools | Select, sequence, and chain tools (Ch. 28) |
| 4 | Remember | Short-term context + long-term retrieval (Ch. 33) |
| 5 | Reflect + verify | Check work, ground in real feedback, correct errors |
| 6 | (Maybe) multi-agent | Orchestrator + specialized workers, if it helps |
| 7 | Reliability + oversight | Bounds, validation, human-in-the-loop, tracing |
The Three Ideas to Remember
If you remember three things about agents: First, an AGENT IS A LOOP, not a model — a system that pursues a goal over many steps by planning, acting, observing, and remembering, integrating everything in this book. Second, RELIABILITY IS THE CHALLENGE — errors compound across steps, so reflection, verification, and grounded feedback (run the test, fix the error) are what make agents work. Third, SIMPLER IS OFTEN BETTER — single agents often beat multi-agent systems, and agents shine in verifiable, bounded domains; add complexity only when it earns its place.
Agents Quick-Reference
| Concept | Key idea | Remember |
|---|---|---|
| Agent | A loop pursuing a goal | A system, not a model |
| Agent loop | Plan, act, observe, repeat | Generalizes ReAct; must be bounded |
| Planning | Decompose goal into steps | Weak in current models; replan |
| Reflection | Critique and fix own work | Exploits generate-evaluate gap |
| Memory | Short-term + long-term | Long-term = RAG on experience |
| Tool orchestration | Select, sequence, chain tools | Errors compound; validate |
| Multi-agent | Specialized agents collaborate | Orchestrator-worker most common |
| Multi-agent risk | Cost, coordination, errors | Not automatically better |
| Reliability | Errors compound over steps | Reflect + verify + bound |
Exercises
Exercises 1–10 are pen-and-paper; 11–20 require code.
Further reading: “ReAct: Synergizing Reasoning and Acting” (Yao et al., 2022). “Reflexion: Language Agents with Verbal Reinforcement Learning” (Shinn et al., 2023). “Self-Refine” (Madaan et al., 2023). “Generative Agents” (Park et al., 2023) for agent memory and behaviour. “AutoGen” (Wu et al., 2023) and the LangGraph documentation for multi-agent frameworks. “Toolformer” (Schick et al., 2023) and the SWE-bench / SWE-agent work for coding agents. “Debate” (Du et al., 2023; Irving et al., 2018) for multi-agent debate.
Next → Chapter 35: Open Problems & Future Directions
Across thirty-four chapters you have built a complete understanding of LLMs — from the mathematics of Part I to the autonomous agents of this chapter. But the field is far from finished. The final chapter steps back to survey what we still DON'T know and CAN'T yet do: the unsolved problems of interpretability, alignment, reliability, reasoning, and safety; the deep questions about what these systems are and where they are heading; and the frontiers where the next breakthroughs — perhaps yours — will come. Having learned how LLMs work, we close by honestly confronting their limits and the open horizon beyond them.