The agent passed every staging test. One model, a clear system prompt, twelve tools, and a loop that called them until the ticket was resolved. It closed support tickets in seconds and the demo got a round of applause.
Three weeks into production it was costing $4,000 a day.
The numbers told a story nobody wanted to read. The p95 latency had tripled. One ticket in twenty came back with a confident, fluent, wrong answer - a refund issued twice, an order marked shipped when it had been cancelled, a policy quoted that the company had retired in 2023. None of it threw an exception. None of it showed up in the error rate.
The team did what every team does first. They upgraded to the most capable model available. The bill went up. The wrong answers stayed.
Then they tried prompt engineering. They added "do not issue duplicate refunds" to the system prompt. It worked in testing and failed in production, because by the time the agent reached the refund decision, that instruction was 14,000 tokens back in a context window stuffed with tool outputs, and the model was no longer attending to it.
The figures in this article - the daily cost above, the accuracy and token tables later - are representative composites drawn from production single-agent systems, not measurements from one published benchmark. The absolute values will differ with your model, tools, and traffic. What is stable, and independently supported by the research cited throughout, is the shape of each curve.
The thesis: these are structural failures, not model-quality failures
Here is the claim this article owns, and it runs against the reflex every team reaches for first: none of those failures are model-quality failures, and a better model does not change the shape of a single one of the five curves. A stronger model can lift a curve's level - nudge tool-selection accuracy up a few points - but it does not bend the slope: accuracy still collapses as tools grow, cost still compounds super-linearly, attention still decays. These are failures of orchestration and state - of asking one agent to hold too much context, choose among too many tools, and track a world that kept changing underneath it.
That is a structural problem, and structural problems need structural fixes. You cannot prompt your way out of an architecture problem, and you cannot pay your way out of one with a frontier model either. Before you can fix it, you have to be able to name it precisely and measure it. This article gives you the vocabulary: the Five-Mode Failure Taxonomy.
One honest caveat up front, because it is the first thing a skeptical staff engineer will raise. The single-agent ceiling is not fixed - it rises as models improve. Recent empirical work (Gao et al., 2025) shows the advantage of multi-agent systems over single-agent ones shrinks as base models get better at long-context reasoning and tool use. This article is not an argument that multi-agent architecture is always the answer. It is an argument that the single-agent loop has specific, nameable structural limits, and that recognizing which limit you have hit is the prerequisite for choosing the right fix - which is sometimes a better single agent, not more agents.
Why the single-agent loop is the right place to start
The single-agent loop is where almost everyone starts, and it works long enough to get into production before it breaks. That is exactly what makes it dangerous.
The pattern is seductive. You give one language model a system prompt, a set of tools, and a loop: call the model, let it pick a tool, run the tool, feed the result back, repeat until the model says it is done. It is the architecture in every framework tutorial. For a narrow task with three tools and short inputs, it is genuinely the correct design.
Here is a position worth stating plainly, because it cuts against the direction this article is pointing: reaching for multi-agent too early is the most common and most expensive architecture mistake in this space. Adding agents multiplies orchestration overhead, debugging surface, and failure modes. The empirical record backs this up - Gao et al. (2025) find the advantage of multi-agent over single-agent shrinks as base models improve, and most teams that split work across agents do so before they have proven their single agent has actually hit a ceiling. Do not let anyone, including this article, talk you out of a single agent when a single agent is enough.
The trouble starts when the task gets real. Real support tickets are long. Real tool sets grow as product managers file requests. Real orders change state while the agent is mid-reasoning, because a human or another service touched the same record. The single-agent loop has no structural answer to any of this. It degrades, and the degradation is invisible until you are measuring the right things.
The wrong way: prompt the symptom
Consider the naive resolution to the duplicate-refund bug from the opening. The obvious fix is a retry guard or a stronger instruction. Here is the version most teams ship first:
# naive_guard.pydef handle_ticket(agent, ticket: str) -> str: # The "fix": tell the model not to repeat itself, and hope it listens. system = ( "You are a support agent. Resolve the ticket. " "Never issue a refund twice for the same order." ) return agent.run(system_prompt=system, user_input=ticket)This fails for a reason that has nothing to do with the model's intelligence. The instruction lives at the top of the context window. By the time the agent has searched the knowledge base, looked up the order, and read three tool results, the effective attention on that early instruction has decayed. The model is not ignoring the rule; the rule is no longer in the part of the context it is reasoning over. This is not folklore - it is the "lost in the middle" effect documented by Liu et al. (2024), where retrieval accuracy drops by more than 30% when the relevant information sits in the middle of a long context rather than at its edges, replicated across every major model family and still present in frontier models.
The deeper issue is that the single agent conflates concerns that production forces apart. It holds conversational state, world state, tool-selection logic, and control flow in one undifferentiated context. When any one of those grows, it crowds out the others. That coupling is the root cause, and it expresses itself in five distinct, measurable ways.
The Five-Mode Failure Taxonomy
Every production single-agent failure I have diagnosed reduces to one of five modes. They are not bugs you fix once. They are pressures that grow with scale, and a single agent has no structural defense against any of them.
Before the taxonomy, one named idea to anchor it:
The Single-Agent Ceiling is the point at which the marginal failure rate from adding capability to one agent - one more tool, more context, another reasoning step - exceeds the marginal competence that capability buys. Below the crossover, each addition makes the agent better; above it, each addition makes the system worse. The ceiling is real but not fixed: a more capable model moves the crossover point higher, which is precisely why "upgrade the model" sometimes works for a while and then stops. Multi-agent architecture is one way to push past the ceiling; a sharper-scoped single agent is another. The taxonomy tells you which.
The five modes are the forces that pull an agent toward its ceiling. Each names a failure, and each has a structural - not prompt-level - response.
| Failure mode | What it is | Why a better model does not fix it |
|---|---|---|
| Tool explosion | Selection accuracy collapses as the tool count grows | More tools means more plausibly-correct wrong choices, regardless of model |
| State drift | The agent acts on a world state that changed mid-run | The model cannot know about a mutation it never observed |
| Context window collapse | Early instructions decay out of effective attention | A larger window delays the collapse; it does not prevent it |
| Latency cascade | Sequential tool hops compound into long-tail latency | Faster tokens do not remove the round-trip count |
| Cost runaway | Token spend grows super-linearly with retries and context | A pricier model makes the same runaway more expensive |
Hold this table in mind. We are about to build a single agent that is good enough to ship, then measure exactly where each of these five forces pushes it past the ceiling.
A note on scope, because there is a well-known taxonomy this one is easily confused with. Cemri et al.'s MAST (Multi-Agent System Failure Taxonomy, UC Berkeley, NeurIPS 2025) catalogs 14 failure modes of multi-agent systems - specification issues, inter-agent misalignment, and task verification - drawn from over 1,600 annotated traces across seven frameworks. The Five-Mode Taxonomy here is its complement, not its competitor. MAST asks why a system of agents fails once you have built one; this taxonomy asks why one agent hits the wall that drives you to build a system in the first place. The five modes are the forces that push you off the single-agent loop; MAST's fourteen are what can break in the multi-agent system you reach for next. Tellingly, the two agree on the deepest point: Cemri et al. find that better base models alone are insufficient to resolve their taxonomy, and conclude that failures stem from system design rather than model limitations - the multi-agent echo of the structural-not-model-quality argument this article makes for the single-agent case.
Building an instrumented single agent
We will build a support-triage agent with LangGraph, instrument it so every failure mode is measurable, and establish a clean module layout. LangGraph reached its 1.0 stable release in October 2025 and its explicit state and graph structure make the failure modes visible in code rather than hidden behind an abstraction. The build is progressive: configuration, then state, then observability, then the world and tools, then the agent graph, then a harness.
The module layout separates the concerns the single-agent loop conflates:
src/├── __init__.py├── config.py # Settings, loaded once from the environment├── state.py # AgentState - the graph's typed state├── observability.py # structlog setup + RunMetrics├── backend.py # the mutable world state the tools act on├── tools.py # the support tools└── agent.py # the compiled LangGraph agentConfiguration loaded once
Settings load once, from the environment, into a typed object. Read configuration through get_settings() and never touch os.environ directly.
# src/config.pyfrom functools import lru_cachefrom pydantic_settings import BaseSettings, SettingsConfigDictclass Settings(BaseSettings): model_config = SettingsConfigDict(env_prefix="PMAS_", env_file=".env") model: str = "claude-sonnet-4-5" # model is config, not a hard-coded string max_tool_hops: int = 8 # the single most important guardrail here request_timeout_s: float = 30.0@lru_cache # process-wide singletondef get_settings() -> Settings: return Settings()The model is configuration, not a hard-coded string. Swapping models is a config change, which is what lets us prove later that model choice does not move the failure curves. The hop limit is the single most important guardrail in this build - without it, a confused agent loops until it exhausts the context window or your budget. The lru_cache makes settings a process-wide singleton, so configuration is read once rather than re-parsed on every node execution.
Typed state
The agent's state is a typed dictionary that LangGraph threads through every node. The add_messages reducer tells LangGraph to append to the message history rather than overwrite it.
# src/state.pyfrom typing import Annotated, TypedDictfrom langchain_core.messages import AnyMessagefrom langgraph.graph.message import add_messagesclass AgentState(TypedDict): messages: Annotated[list[AnyMessage], add_messages] # appends, not overwrites ticket_id: str # ties logs to one run hops: int # agent-tool round tripsThe add_messages reducer appends new messages on each update instead of replacing the list. This is what makes the conversation accumulate - and, as we will see, what makes context window collapse and cost runaway inevitable without intervention. The ticket ID ties every log line and metric back to one run. The hops field counts agent-tool round trips so we can bound the loop against max_tool_hops.
Observability first
You cannot diagnose what you do not measure, and the entire point here is to measure failure. The pattern is structured logging through structlog plus a RunMetrics object that captures per-hop latency and token counts.
# src/observability.pyfrom dataclasses import dataclass, fieldimport structlogstructlog.configure( processors=[ structlog.processors.add_log_level, structlog.processors.TimeStamper(fmt="iso"), structlog.processors.JSONRenderer(), # JSON is non-negotiable in production ])log = structlog.get_logger()@dataclassclass HopRecord: hop: int tool: str | None latency_s: float total_tokens: int@dataclassclass RunMetrics: ticket_id: str records: list[HopRecord] = field(default_factory=list) def record(self, rec: HopRecord) -> None: self.records.append(rec) log.info( "hop", ticket=self.ticket_id, hop=rec.hop, tool=rec.tool, latency_s=round(rec.latency_s, 3), tokens=rec.total_tokens, ) @property def total_tokens(self) -> int: return sum(r.total_tokens for r in self.records) @property def total_latency_s(self) -> float: return sum(r.latency_s for r in self.records)JSON output is non-negotiable for production. Structured logs are queryable; formatted strings are not. This single choice is the difference between answering "what did this agent cost?" in a dashboard versus grepping through text. Emitting one structured event per hop is what makes the cost and latency curves later visible at all - the metric is captured at the moment of the hop, not reconstructed afterward.
The world state and tools
The tools act on a mutable world: an order store. Making that store explicit, in its own module, is deliberate. The single-agent loop hides world state inside tool calls; we pull it into the open so that state drift becomes something we can trigger and observe.
# src/backend.pyfrom dataclasses import dataclass@dataclassclass Order: order_id: str status: str # "shipped", "cancelled", "refunded" amount_cents: int# The mutable world. In production this is a database; the failure mode# is identical, only the latency differs.ORDERS: dict[str, Order] = { "A-1001": Order("A-1001", "shipped", 4999), "A-1002": Order("A-1002", "cancelled", 2500),}The tools are thin wrappers over that world. Each one has a docstring, because the docstring is the only thing the model sees when it decides whether to call the tool. Vague docstrings are the first cause of tool explosion, so write them as carefully as you write the code.
# src/tools.pyfrom langchain_core.tools import toolfrom .backend import ORDERS@tooldef lookup_order(order_id: str) -> str: """Return the current status and amount of an order by its ID.""" order = ORDERS.get(order_id) if order is None: return f"No order found with ID {order_id}." return f"Order {order_id}: status={order.status}, amount={order.amount_cents}c."@tooldef issue_refund(order_id: str, amount_cents: int) -> str: """Refund amount_cents against an order. Only call for shipped orders.""" order = ORDERS.get(order_id) if order is None: return f"No order found with ID {order_id}." order.status = "refunded" # mutates without re-reading return f"Refunded {amount_cents}c for order {order_id}."@tooldef escalate(reason: str) -> str: """Hand the ticket to a human agent with a short reason.""" return f"Escalated to human queue: {reason}"ALL_TOOLS = [lookup_order, issue_refund, escalate]Note the latent bug in issue_refund: it mutates the world but does not re-read it first. This becomes state drift the moment a second actor touches the same order between the lookup and the refund.
The agent graph
Now wire it together. The graph has two nodes - the agent and the tools - and a conditional edge that loops until the model stops requesting tools or we hit the hop limit. This is the single-agent loop in its honest form.
# src/agent.pyimport timefrom langchain_anthropic import ChatAnthropicfrom langchain_core.messages import SystemMessagefrom langgraph.graph import END, START, StateGraphfrom langgraph.prebuilt import ToolNodefrom .config import get_settingsfrom .observability import HopRecord, RunMetricsfrom .state import AgentStatefrom .tools import ALL_TOOLSSYSTEM = SystemMessage( content=( "You are a support triage agent. Resolve the ticket using the " "available tools. Never issue a refund twice for the same order." ))_settings = get_settings()_llm = ChatAnthropic(model=_settings.model, timeout=_settings.request_timeout_s)_llm_with_tools = _llm.bind_tools(ALL_TOOLS)def make_agent_node(metrics: RunMetrics): def agent_node(state: AgentState) -> dict: start = time.perf_counter() response = _llm_with_tools.invoke([SYSTEM, *state["messages"]]) usage = response.usage_metadata or {} # token usage from the response tool = None if response.tool_calls: tool = response.tool_calls[0]["name"] metrics.record( HopRecord( hop=state["hops"] + 1, tool=tool, latency_s=time.perf_counter() - start, total_tokens=usage.get("total_tokens", 0), ) ) return {"messages": [response], "hops": state["hops"] + 1} return agent_nodedef should_continue(state: AgentState) -> str: if state["hops"] >= get_settings().max_tool_hops: # hard bound on the loop return END last = state["messages"][-1] return "tools" if getattr(last, "tool_calls", None) else ENDdef build_agent(metrics: RunMetrics): graph = StateGraph(AgentState) graph.add_node("agent", make_agent_node(metrics)) graph.add_node("tools", ToolNode(ALL_TOOLS)) graph.add_edge(START, "agent") graph.add_conditional_edges("agent", should_continue) graph.add_edge("tools", "agent") return graph.compile()Token usage comes straight from the model response. Capturing it here, per hop, is what turns cost from an end-of-month surprise into a per-ticket metric. The hop limit is checked before every continuation - this one line is the difference between a bounded failure and a runaway loop that empties your account.
The diagram below shows the control flow: the agent node decides, the tools node executes, and the conditional edge closes the loop until the work is done or the hop limit trips.
flowchart LR
START([START]) --> A[agent node]
A -->|tool_calls present| T[tools node]
T --> A
A -->|no tool_calls or hop limit| E([END])
style START fill:#95A5A6,stroke:#5D6D7E,color:#fff
style A fill:#4A90E2,stroke:#2E5C8A,color:#fff
style T fill:#7B68EE,stroke:#4B2E83,color:#fff
style E fill:#6BCF7F,stroke:#3E8E54,color:#fff
Finally, a harness ties it together: create a RunMetrics for the run, build the agent around it, invoke it with the ticket, and read the totals.
# run.pyfrom langchain_core.messages import HumanMessagefrom src.agent import build_agentfrom src.observability import RunMetrics, logdef resolve_ticket(ticket_id: str, body: str) -> RunMetrics: metrics = RunMetrics(ticket_id=ticket_id) app = build_agent(metrics) # metrics injected per run app.invoke( { "messages": [HumanMessage(content=body)], "ticket_id": ticket_id, "hops": 0, } ) log.info( "run_complete", ticket=ticket_id, total_tokens=metrics.total_tokens, total_latency_s=round(metrics.total_latency_s, 3), ) return metricsif __name__ == "__main__": m = resolve_ticket( "T-42", "I was charged twice for order A-1001. Please refund one." ) print(f"resolved in {len(m.records)} hops, {m.total_tokens} tokens")The RunMetrics object is created per run and injected into the graph, so every hop records into the same run. One structured event closes each run with its total cost and latency - those are the per-ticket numbers the measurements below are built from.
This agent works. It resolves straightforward tickets correctly, and against a small tool set with short inputs it is the right design. The rest of the article is about what happens when production stops being small.
Measuring all five failure modes
We now run the agent under the conditions production actually imposes. As noted at the outset, the tables below show the shape of each curve, not benchmarks from a single run - and each shape is independently supported by the literature cited.
Tool explosion: accuracy collapses as tools grow
The first pressure is the tool set. Product managers file requests, and the tool list grows from three to a dozen to thirty. Each tool you add introduces another plausibly-correct wrong choice. The table below shows tool-selection accuracy as the set grows, holding the model and the tickets fixed.
| Tools available | Correct tool selected | Notes |
|---|---|---|
| 3 | 98% | Clean separation between tools |
| 8 | 91% | First overlaps appear (refund vs. partial-refund) |
| 16 | 79% | Model picks plausible-but-wrong tools on ambiguous tickets |
| 30 | 61% | Selection is now the dominant failure source |
The model is not getting dumber. The decision is getting harder, and it is getting harder faster than linearly because overlapping tool descriptions compete for the same intent. This is not a quirk of one setup: empirical studies of tool learning report selection-accuracy losses ranging from 7% to 85% as the tool catalogue grows, and find that with more candidate tools models tend to get "stuck" selecting useless tools rather than stopping. A more capable model shifts every row up a few points and changes nothing about the slope. This is the first reason topology matters, and it is why the build above kept the tool set to three.
The structural fix is not "write better docstrings" - that buys you a few points. It is to stop presenting the full tool surface to one agent at all: retrieval-based tool filtering, meta-tools, or splitting the surface across specialized agents each holding a small, coherent toolset.
State drift: acting on a world that moved
Recall that the refund tool reads nothing before it writes. Now run a realistic sequence. The agent calls lookup_order("A-1001") and sees status=shipped. While the model reasons about whether to refund, a separate process - a human agent, a scheduled job, another agent - refunds the same order. The model then calls issue_refund on a state that no longer exists. The duplicate refund the system prompt forbade happens anyway, because the prohibition was never the problem.
# demonstrate_drift.pyfrom src.backend import ORDERS# Hop 2: agent looked up the order and saw "shipped".assert ORDERS["A-1001"].status == "shipped"# Between hops, another actor refunds it.ORDERS["A-1001"].status = "refunded"# Hop 4: the agent, reasoning from stale context, issues a refund anyway.# issue_refund never re-reads status, so it succeeds. Money leaves twice.No model upgrade prevents this. The agent acted correctly on the information it had; the information was stale. State drift is a state-architecture problem, and the fix lives in the tool, not the prompt. Re-read the world inside the write and refuse the action if it no longer holds - the seed of optimistic-concurrency control:
# src/tools.py - the right way: read-before-write inside the mutation@tooldef issue_refund(order_id: str, amount_cents: int) -> str: """Refund amount_cents against an order. Only call for shipped orders.""" order = ORDERS.get(order_id) if order is None: return f"No order found with ID {order_id}." if order.status != "shipped": # re-read, do not trust prior context return ( f"Refused: order {order_id} is '{order.status}', not 'shipped'. " "No refund issued." ) order.status = "refunded" return f"Refunded {amount_cents}c for order {order_id}."The guard lives at the moment of the write, where the world state is authoritative, rather than relying on what the model read several hops earlier. In production, make the check and the write atomic - run them in a transaction or behind a version check - so two actors cannot both pass the guard at once. The duplicate refund is now structurally impossible regardless of what the context window says, which is the entire point.
Latency cascade: the tail is the sum, not the average
Each agent-tool round trip is a model call plus a tool call. They are sequential by construction: the model cannot pick hop four until hop three returns. A ticket that needs five hops pays five round trips end to end, and the p95 of the whole system is dominated by the tickets with the longest chains, not by the average hop latency.
Take an illustrative single hop averaging 1.2 seconds: a five-hop ticket is not 1.2 seconds, it is six, before any retry. Faster token generation trims each hop a little and leaves the round-trip count untouched. Latency must be budgeted per hop and per chain, not as a single average - and the structural lever is parallelism across independent subtasks, which a single sequential loop cannot offer.
Cost runaway: the bill is the last alarm to fire
Latency you feel in seconds. Wrong answers you catch in QA. Cost accumulates silently across thousands of tickets and surfaces on an invoice weeks later. By then the runaway has been compounding the whole time. This is the metric that alarms last and hurts most.
Two forces multiply token spend super-linearly. First, the add_messages reducer means every hop re-sends the entire growing conversation, so a ten-hop ticket pays for early context ten times over. Second, naive retries on a flaky tool re-send that same swelling context on every attempt. The table below shows tokens per resolved ticket as ticket complexity grows.
| Hops to resolve | Tokens (naive, flat assumption) | Tokens (actual, cumulative context) |
|---|---|---|
| 2 | 3,000 | 4,200 |
| 5 | 7,500 | 18,500 |
| 10 | 15,000 | 61,000 |
The naive column assumes each hop costs the same. The actual column is what you pay, because context accumulates and every hop re-sends all of it. A pricier model multiplies the right-hand column without changing its shape. This re-send mechanic is real enough that LangGraph shipped a DeltaChannel (beta, May 2026) that stores only the incremental delta per checkpoint instead of re-serializing accumulated state - an explicit acknowledgment that naive full-context accumulation is a cost problem worth re-architecting for.
It is worth stating the multiplier honestly in both directions: multi-agent is not free. Anthropic reports its production multi-agent research system uses roughly 15x the tokens of a single chat interaction. Multi-agent architecture trades token cost for capability and parallelism; it is a structural fix for the other four modes, not a way to make tokens cheaper.
Context window collapse: the silent one
The failure with no error is the worst to operate. As the conversation and tool outputs grow, early instructions - the refund policy, the tone guide, the compliance constraint - decay out of the part of the context the model effectively attends to. The agent does not announce that it has forgotten the policy. It simply stops following it, intermittently, on exactly the long tickets where the policy matters most.
This is the "lost in the middle" effect operating in a live loop: the U-shaped attention curve means information buried in the middle of a long context is the most likely to be missed, and Anthropic's framing of "context rot" makes the same point - every token spends a finite attention budget. A larger context window delays the collapse to longer tickets; it never removes it. Chroma's 2025 testing confirmed the degradation persists in frontier models. This is why the fix is to not let one context grow unbounded - to give each concern its own bounded context, which is structurally what multiple agents provide.
Why this is not a framework comparison
You might expect a foundations piece to open with LangGraph versus CrewAI versus AutoGen. It does not, on purpose.
The five failure modes are properties of the single-agent architecture, not of any framework. Tool explosion happens in CrewAI for the same reason it happens in LangGraph: more tools, harder selection. State drift happens in raw function calling for the same reason it happens in a graph: the world changed between read and write. A framework comparison this early teaches you the syntax of the thing you should not be doing yet.
LangGraph is used here because its explicit state and graph structure make the failure modes visible in code rather than hidden behind an abstraction. But the taxonomy is portable. If you are on a different framework, every mode still applies to you, and every structural fix has an equivalent in your stack. Learn the failure modes first. The framework is an implementation detail.
A diagnostic checklist for your own agent
When your production single agent misbehaves, map the symptom to the mode before you reach for a fix:
- Wrong tool chosen on ambiguous inputs, worsening as you add tools - tool explosion. Fix: retrieval-based tool filtering or split the tool surface across specialized agents. Do not just rewrite docstrings.
- Correct-looking action on stale data; duplicate or contradictory writes - state drift. Fix: read-before-write guards and optimistic concurrency in the tools, not a stronger prompt.
- p95 latency dominated by a few long tickets - latency cascade. Fix: budget per hop and per chain; parallelize independent subtasks.
- Invoice climbing faster than ticket volume - cost runaway. Fix: bound context growth, cap hops, deduplicate re-sent context (delta checkpointing where available).
- Intermittent policy violations on long tickets, no error thrown - context window collapse. Fix: bound each context; do not rely on a larger window.
- You have not yet tuned the single agent - you do not know your ceiling. Fix: tighten prompts and tools on a frontier model before adding agents. Multi-agent too early is the most common and most expensive mistake.
The decision rule: if the failing mode is tool explosion, context collapse, or latency cascade and you have already tuned the single agent, you have hit the ceiling and multi-agent topology is the structural answer. If it is state drift or cost runaway, fix it in the tools and state layer first - those have single-agent fixes, and adding agents without fixing them just multiplies the bug.
Closing: name it before you fix it
The agent that cost $4,000 a day was not running a bad model. It was running a good model inside an architecture that conflated four kinds of state into one context and one tool surface. Every fix the team tried first - bigger model, stronger prompt - targeted the wrong layer, because the failures were structural and the fixes were not.
That is the whole argument. The Five-Mode Failure Taxonomy - tool explosion, state drift, context window collapse, latency cascade, cost runaway - is not a list of bugs. It is a diagnostic vocabulary for the Single-Agent Ceiling, and the first thing it tells you is which fixes are wasted effort. A better model can shift the level of a curve but never bends its slope. Naming the mode tells you whether the answer lives in your tools, your state layer, or your topology. Measure first, name second, and only then decide whether you need more agents or just a better-scoped one.
References
- Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics (TACL). https://arxiv.org/abs/2307.03172
- Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J. E., & Stoica, I. (2025). Why Do Multi-Agent LLM Systems Fail? NeurIPS 2025 Datasets and Benchmarks Track. arXiv:2503.13657. https://arxiv.org/abs/2503.13657
- Gao, M., Li, Y., Liu, B., Yu, Y., Wang, P., Lin, C.-Y., & Lai, F. (2025). Single-agent or Multi-agent Systems? Why Not Both? arXiv:2505.18286. https://arxiv.org/abs/2505.18286
- What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks (2024). arXiv:2407.03007. https://arxiv.org/abs/2407.03007
- From REST to MCP: An Empirical Study of API Wrapping and Automated Server Generation for LLM Agents (2025). arXiv:2507.16044. https://arxiv.org/abs/2507.16044
- Anthropic (2025). Effective context engineering for AI agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- Chroma Research (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. https://research.trychroma.com/context-rot
- Anthropic (2025). How we built our multi-agent research system. https://www.anthropic.com/engineering/built-multi-agent-research-system
- LangGraph Official Docs. https://langchain-ai.github.io/langgraph/
- LangChain Changelog (LangGraph 1.0, DeltaChannel). https://docs.langchain.com/oss/python/releases/changelog
Related Articles
- Unified Observability Across Agent Fleets: Building the Control Plane Metric Layer
- Why Your Agentic RAG System Costs 10x More Than It Should
- The 7 GenAI Architectures Every AI Engineer Should Know
- Agent Versioning and Deployment Strategies: Shipping Agent Updates Without Breaking Running Pipelines