Fault Isolation and Circuit Breaking: Stop Retrying LLM Calls Like Microservices

A six-agent research pipeline started throwing 529s at 9:40 AM, the morning after a model upgrade. The retry logic did exactly what the playbook said: exponential backoff, six attempts, wrapped around every agent's model call. By 9:43 the provider had recovered. The pipeline did not recover until 10:15.

The reason it stayed down for thirty-two minutes after the provider was healthy is the entire subject of this article. Every client in the fleet was backing off on the same schedule - 2 seconds, 4, 8, 16 - so they all retried at the same instants, in a synchronized wave that re-saturated the provider the moment it came back. The retries were not recovering from the overload. They were the overload. And because the agents wrapped a tenacity decorator around a model SDK that already retries twice on its own, a single logical call could fan out to a dozen real ones. The bill for that half hour was roughly four times a normal morning, and most of those tokens bought nothing - they were spent retrying requests that had already failed for good.

None of this was a bug. It was the standard resilience playbook, applied faithfully, to calls it was never designed for.

The thesis: the microservice resilience playbook is wrong for LLM calls

Here is the claim this article defends: exponential backoff and circuit breakers, ported straight from the microservice world, are not just suboptimal for LLM calls - they are actively harmful, because LLM failures violate every assumption those patterns were built on. The fix is not to tune the backoff curve. It is to throw out the idea that one retry strategy applies to all failures, and replace it with classification.

The consensus is that resilience is a solved, generic layer: wrap the risky call in retry-with-backoff, add a circuit breaker keyed on error rate, move on. That works for a stateless RPC to a service down the hall. Exponential backoff with jitter is one of the most battle-tested patterns in distributed systems, and it is correct - for the failure model it assumes. Marc Brooker's canonical analysis of backoff assumes faults that are transient, statistically independent across clients, and cheap and fast to retry, against operations that are idempotent. Those four assumptions are load-bearing. LLM calls break all four:

Not transient or independent. A 429 means you hit your own token-per-minute ceiling; a 529 means the provider is shedding load across its entire customer base. Either way the failure is correlated - when one of your agents gets rate-limited, the rest are about to, and so is everyone else's fleet. Independent-failure math does not apply to a shared constraint.
Not cheap or fast. A retry of a 50ms RPC costs 50ms. A retry of an LLM call costs seconds of wall-clock and real money per token. Backoff schedules tuned for milliseconds are nonsense when the base operation takes fifteen seconds and the user-facing budget is thirty.
Not idempotent. A retried TCP segment is free and identical. A retried generation that already streamed three thousand tokens before timing out re-bills the whole thing and may produce a different answer.

When the failure model is wrong, the pattern built on it does the opposite of what you want. That is how a three-minute provider blip becomes a thirty-two-minute outage.

Why DAG-structured agent pipelines amplify the damage

A single agent retrying badly wastes some money. A directed-acyclic-graph (DAG) of agents retrying badly takes down the whole graph, and does it in ways that are hard to see.

Agent pipelines fan out: an orchestrator dispatches work to specialist agents, each of which makes its own model and tool calls. I described how the seams between those agents fail in Multi-Agent Topology Patterns - every topology has a contract that tears first under load. Resilience is where that tear becomes a cascade. Three amplifiers stack up:

Correlated failure across the fan-out. When the orchestrator dispatches eight agents and the provider is rate-limiting, all eight fail at once - same limit, same instant. Eight independent retry loops now hammer the same recovering endpoint in lockstep.

Metastability. This is the mechanism behind the opening story, and it has a name in the literature. Bronson and colleagues, in their HotOS 2021 study of metastable failures, found that retry-induced load amplification was the sustaining effect in more than half of the real-world incidents they analyzed. A metastable failure has a trigger (the provider blip) and a sustaining effect (your retries) that keeps the system pinned in a degraded state long after the trigger is gone. Brooker's framing is sharp: adding retries lowers your day-to-day error rate while making the system far more likely to convert a small outage into a self-sustaining retry storm.

Whole-graph stalls. In a fan-out where the orchestrator waits for all N children, one agent stuck in a six-attempt backoff loop holds the entire pipeline hostage. The other seven finished in eight seconds; the graph blocks for two minutes on the one that should have failed fast and degraded.

The shared lesson with state architecture applies here too: a long-running network has to survive partial failure without corrupting the whole run. Resilience is how you contain a fault to one node instead of letting it propagate across the DAG.

The wrong way: one retry policy wrapped around every call

Here is the pattern that ships in most agent pipelines. It looks responsible. It is a loaded gun.

code

from tenacity import retry, wait_exponential, stop_after_attemptimport anthropicclient = anthropic.Anthropic()  # SDK already retries twice on its own@retry(wait=wait_exponential(multiplier=1, max=60), stop=stop_after_attempt(6))def call_llm(messages: list) -> str:    resp = client.messages.create(        model="claude-sonnet-4-6",        max_tokens=1024,        messages=messages,    )    return resp.content[0].textdef enrichment_agent(state: dict) -> dict:    return {"enrichment": call_llm(build_messages(state))}

Count the ways this misfires the moment the provider has a bad minute:

It retries failures that will never succeed. A context_length_exceeded error comes back as a 400. A safety refusal comes back as a terminal error. A malformed request is a 400. wait_exponential happily retries all of them six times - waiting 2, 4, 8, 16, 32 seconds between attempts - and every single retry re-sends the full prompt and gets billed for the input tokens, then fails identically. You pay six times for a guaranteed failure.
It ignores the answer the provider already gave you. A 429 from Anthropic carries a retry-after header telling you exactly how many seconds until your token bucket refills. wait_exponential discards it and guesses. If the header says retry in 7 seconds, backoff either waits 2 (fails again) or 32 (wastes 25 seconds of the user's budget).
It has no jitter, so the fleet synchronizes. wait_exponential with no jitter produces the square wave that re-overloads a recovering provider - the metastable trap.
The retries stack. The Anthropic and OpenAI Python SDKs default to max_retries=2. Your stop_after_attempt(6) wraps that, so one logical call can become up to eighteen real attempts, and a "hung" agent is often just two retry layers compounding into a multi-minute sleep.
There is no circuit breaker and no fallback. When the provider is genuinely down, every call still pays the full retry tax before failing, and the failure propagates up as an exception that takes the DAG with it.

The deeper problem is the shape of the code: a single @retry decorator treats every exception identically. It cannot, because the failures are not identical.

The right way: classify the failure, then route it

Not all failures want the same response. A failure that means "you're over your quota" wants a timed wait. A failure that means "the provider is on fire" wants backoff plus a fallback. A failure that means "your request is invalid" wants to fail immediately and never retry. Treating these the same is the root error. The fix is to classify every failure into one of a small number of classes and route each class to its own handler. I call this Failure-Class Routing.

Three classes cover almost every real failure, and the fourth is an economic property that cuts across them:

Transient (your rate limit). A 429 rate_limit_error means you exceeded your own requests-per-minute or tokens-per-minute. It is retryable, but not with a guess - the response carries a retry-after header and anthropic-ratelimit-*-reset timestamps that tell you exactly when your bucket refills. Honor them.
Systemic (provider overload). Anthropic's 529 overloaded_error (a non-standard code - it is provider-wide load shedding, not your fault), and 500/503, mean the provider is struggling for everyone. This is retryable with jittered backoff, but it is also the only class that should trip a circuit breaker and trigger a provider fallback - because it is the only one that means "this provider is currently unavailable."
Terminal (deterministic). 400 invalid_request_error, 401, 403, 413, context_length_exceeded (which arrives as a 400), and content-filter refusals. These will fail identically every time. Retrying them is pure waste - the canonical money-burner. Route them to a fix (a smaller-context model for a context-length error, exactly what litellm's context_window_fallbacks does) or to graceful degradation. Never to a retry loop.
Sunk-Token (already billed). Orthogonal to the above: a generation that streamed thousands of tokens and then timed out, or completed and then got filtered, has already cost you. I call a naive retry of one of these a Sunk-Token Retry - you re-pay the entire cost for a fresh attempt. A TCP retransmit is free; this is not. Reconcile what you already received before deciding to replay.

Here is the routing made concrete. The classifier and the resilient call replace the blanket decorator:

code

import random, timefrom enum import Enumimport anthropicfrom anthropic import APIStatusError, APIConnectionErrorclient = anthropic.Anthropic(max_retries=0)  # we own the retry policy, not the SDKclass FailureClass(Enum):    TRANSIENT = "transient"    # 429: your quota - honor retry-after    SYSTEMIC = "systemic"      # 529/5xx: provider down - backoff + breaker + fallback    TERMINAL = "terminal"      # 4xx deterministic - never retryclass CircuitOpen(Exception):    ...def classify(err: Exception) -> FailureClass:    if isinstance(err, APIConnectionError):        return FailureClass.SYSTEMIC    status = getattr(err, "status_code", None)    if status == 429:        return FailureClass.TRANSIENT    if status in (500, 502, 503, 504, 529):        return FailureClass.SYSTEMIC    return FailureClass.TERMINAL  # 400/401/403/413/422/context-length/content-filterdef retry_after(err: Exception) -> float | None:    resp = getattr(err, "response", None)    hdr = resp.headers.get("retry-after") if resp else None    return float(hdr) if hdr else Nonedef full_jitter(attempt: int, base: float = 1.0, cap: float = 20.0) -> float:    return random.uniform(0, min(cap, base * 2 ** attempt))  # AWS full jitter

code

class ProviderBreaker:    """One breaker per provider. Only SYSTEMIC failures count toward it."""    def __init__(self, threshold: int = 5, cooldown: float = 30.0):        self.threshold, self.cooldown = threshold, cooldown        self.failures, self.opened_at = 0, None    def allow(self) -> bool:        if self.opened_at is None:            return True        return (time.monotonic() - self.opened_at) >= self.cooldown  # half-open trial    def record_failure(self, cls: FailureClass) -> None:        if cls is not FailureClass.SYSTEMIC:            return  # a 400 is your bug; a 429 is your quota. Neither means "provider down."        self.failures += 1        if self.failures >= self.threshold:            self.opened_at = time.monotonic()    def record_success(self) -> None:        self.failures, self.opened_at = 0, Nonedef call_resilient(breaker: ProviderBreaker, *, messages: list,                   deadline_s: float, max_attempts: int = 4) -> str:    if not breaker.allow():        raise CircuitOpen("provider breaker open")    start = time.monotonic()    for attempt in range(max_attempts):        try:            resp = client.messages.create(                model="claude-sonnet-4-6", max_tokens=1024, messages=messages)            breaker.record_success()            return resp.content[0].text        except (APIStatusError, APIConnectionError) as err:            cls = classify(err)            breaker.record_failure(cls)            if cls is FailureClass.TERMINAL:                raise  # deterministic: retrying re-bills for a guaranteed failure            remaining = deadline_s - (time.monotonic() - start)            wait = retry_after(err) if cls is FailureClass.TRANSIENT else None            wait = wait if wait is not None else full_jitter(attempt)            if attempt == max_attempts - 1 or wait >= remaining:                raise  # never sleep past the caller's deadline            time.sleep(wait)    raise RuntimeError("unreachable")

Now the agent wraps this in a partial-failure contract instead of letting an exception escape into the DAG:

code

BREAKERS = {"anthropic": ProviderBreaker()}def enrichment_agent(state: dict) -> dict:    try:        text = call_resilient(BREAKERS["anthropic"],                              messages=build_messages(state),                              deadline_s=state["deadline_s"])        return {"enrichment": text, "degraded": False}    except (CircuitOpen, APIStatusError, APIConnectionError, RuntimeError):        # graceful degradation: the pipeline's explicit partial-failure contract.        # This agent's failure is contained here - the DAG still completes.        return {"enrichment": None, "degraded": True}

Every difference from the wrong way is deliberate. Terminal errors raise immediately and never re-bill. Transient errors sleep for exactly as long as the provider asked. Systemic errors use full jitter so the fleet does not synchronize, and they are the only class that advances the breaker toward open. The SDK's own retries are disabled (max_retries=0) so layers cannot stack. Every retry is bounded by the caller's deadline, not a fixed attempt count, so a slow agent can never hold the graph hostage. And when all else fails, the agent degrades to a valid partial result rather than throwing.

Failure-Class Routing and the Sunk-Token Retry

The two named ideas are worth stating plainly because they are what you carry into your next code review.

Failure-Class Routing is the discipline of mapping each failure to a class before choosing a response, and giving each class its own handler. The anti-pattern it replaces is the single @retry (or single circuit breaker) that sees one undifferentiated stream of exceptions. The classification does not have to be elaborate - three classes and one economic flag handle the vast majority of production failures - but it has to exist. The most important routing rule is the one most systems get wrong: only systemic failures may trip the provider circuit breaker. A breaker that opens because you sent a malformed request, or because you hit your own quota, is a breaker that takes itself offline for reasons that have nothing to do with provider health.

The Sunk-Token Retry is the economic reason blind retry is dangerous for LLMs specifically. In the systems backoff was designed for, a failed attempt is free to repeat - the whole point of TCP retransmission is that the cost of a dropped packet is near zero. An LLM call inverts that. A request that fails after the model has done work - a long generation that timed out, a streamed response aborted midway, a completion that was produced and then refused by a content filter - has already consumed billable tokens. Retrying it does not resume; it re-bills from the start. This is why, as one analysis of retry budgets puts it, a 20-percent per-step retry rate in a multi-agent pipeline does not add 20 percent to your bill - it compounds across steps toward something closer to a doubling. (The exact multiplier depends on your provider's billing for partial and aborted requests, which is worth verifying against current pricing - but the direction is not in question.) Backoff math optimizes for time. For LLMs you are also optimizing for money, and the two budgets run out at different rates. Call it the two-budget rule: every LLM retry spends a latency budget and a money budget, and a resilience policy that bounds only one of them - the universal habit of capping retries by attempt count alone - is half a policy. The right way above bounds latency with a deadline; bounding money means tracking sunk tokens and refusing the retry whose expected cost exceeds what the answer is worth.

How the failure classes actually behave at the provider

The classification is not arbitrary - it maps directly to how the providers behave, and the details matter enough that hard-coding the wrong assumption is its own failure mode.

A 429 from Anthropic is governed by a token bucket: your limits (requests per minute, input tokens per minute, output tokens per minute, enforced per model class) refill continuously rather than resetting on a fixed clock. The response headers expose the full state - anthropic-ratelimit-requests-remaining, anthropic-ratelimit-input-tokens-remaining, anthropic-ratelimit-tokens-reset, and the decisive retry-after. Honoring retry-after is not politeness; it is the difference between retrying when capacity actually exists and retrying into a wall. Note one subtlety the headers encode: for most models only uncached input tokens count toward your input-token limit, so prompt caching changes your effective rate limit, not just your cost.

A 529 is a different animal and deserves a different reflex. It is overloaded_error - the provider shedding load across all customers. No retry-after you set will fix a capacity problem you do not own. The correct response is broad jittered backoff and a fallback path to another model or provider, gated by the circuit breaker. Conflating 429 and 529 - treating "I'm over quota" and "the provider is overloaded" as the same retryable event - is one of the most common mistakes, and it is why the classifier separates them.

Terminal errors are the ones worth being ruthless about. context_length_exceeded returns as a 400; no number of retries shrinks your prompt. A safety refusal is a property of the request, not a transient fault. These belong in front of a router that does something useful - swap to a larger-context model, summarize and resubmit, or degrade - never in a backoff loop. litellm's router encodes exactly this instinct with context_window_fallbacks firing on ContextWindowExceededError while ordinary transient errors go to num_retries and repeated failures trigger a deployment cooldown. That is Failure-Class Routing implemented at the gateway, and it is the closest thing to an LLM-aware circuit breaker available off the shelf.

mermaid

flowchart TD
    Call[Agent makes LLM call]:::primary --> Err{Call failed?}:::decision
    Err -->|No| Ok[Return result]:::success
    Err -->|Yes| Classify[Classify the failure]:::primary

    Classify --> Terminal[Terminal: 4xx / context-length / filter]:::err
    Classify --> Transient[Transient: 429 your quota]:::warn
    Classify --> Systemic[Systemic: 529 / 5xx provider]:::alt

    Terminal --> FailFast[Fail fast - never retry.<br/>Fix, fall back, or degrade]:::err
    Transient --> Honor[Sleep retry-after,<br/>bounded by deadline]:::warn
    Systemic --> Breaker{Breaker open?}:::decision
    Breaker -->|Yes| Degrade[Graceful degradation:<br/>fallback model or partial result]:::secondary
    Breaker -->|No| Jitter[Full-jitter backoff,<br/>record failure, retry in budget]:::alt

    Honor --> Call
    Jitter --> Call
    FailFast --> Degrade

    classDef primary fill:#4A90E2,stroke:#3A7BC8,color:#FFFFFF;
    classDef secondary fill:#98D8C8,stroke:#6FB8A8,color:#2C2C2A;
    classDef decision fill:#7B68EE,stroke:#6858DE,color:#FFFFFF;
    classDef warn fill:#FFD93D,stroke:#E0BC2D,color:#2C2C2A;
    classDef alt fill:#FFA07A,stroke:#E08763,color:#2C2C2A;
    classDef success fill:#6BCF7F,stroke:#54B868,color:#2C2C2A;
    classDef err fill:#E74C3C,stroke:#C0392B,color:#FFFFFF;

Isolating the fault: bulkheads and the partial-failure contract

Classification decides how to treat one failed call. Fault isolation decides what happens to the rest of the DAG when a call fails for good. Two patterns from Michael Nygard's Release It! carry over directly, with one adjustment each for the agent setting.

The bulkhead isolates resources so one compartment flooding does not sink the ship. In an agent pipeline, the resource to compartmentalize is concurrency and token budget per agent. If your enrichment agent is allowed to consume the entire connection pool or the whole per-minute token budget while it thrashes on retries, it starves the agents that were going to succeed. Give each agent (or each provider) its own bounded budget and its own breaker, so a fault stays in its lane. This is also why the breaker is keyed per provider in the code above, not global - a Anthropic outage should not open the breaker protecting your OpenAI fallback.

The partial-failure contract is the agent-specific half. A microservice circuit breaker fails fast by throwing; an agent that throws into a fan-out orchestrator just relocates the cascade. The fix is to make degradation an explicit, typed outcome - the degraded: True flag in the example - that downstream agents are contractually required to handle. This connects directly to the Contract Surface from Multi-Agent Topology Patterns: "what does this agent return when it could not do its job" is one of the load-bearing contracts, and a pipeline without it cannot degrade, only crash. A synthesis agent that knows enrichment may be missing can write a slightly thinner report; a synthesis agent that assumes enrichment is always present produces confident garbage or dies.

LangGraph gives you a useful primitive here, but it is retry, not breaking. Its RetryPolicy (defaults: initial_interval=0.5, backoff_factor=2.0, max_attempts=3, jitter=True) attaches per node, and crucially its default retry_on already declines to retry ValueError, TypeError, and SyntaxError - a primitive form of the terminal/transient split. More importantly for isolation, when a node raises under a checkpointer, LangGraph writes the error to pending writes, does not advance the checkpoint, and does not re-run the siblings that already succeeded in that superstep. That is genuine graph-level fault isolation - the successful work in a fan-out is preserved across a resume. But RetryPolicy will not read a retry-after header, will not open a circuit, and will not fall back to another provider. Those you build, at the agent boundary, with the classification above. Treat RetryPolicy as the transport-level safety net and Failure-Class Routing as the policy that decides whether a retry should happen at all.

Decision guide: resilience for LLM agent pipelines

Work through these before your pipeline meets a bad provider minute.

Disable the SDK's automatic retries (max_retries=0) if you implement your own. Stacked retry layers are the cause of most "it just hung for ten minutes" reports. Own the policy in one place.
Classify every failure before you react. Three classes - transient (429), systemic (529/5xx), terminal (deterministic 4xx) - plus the sunk-token flag. If you cannot classify it, treat it as terminal and fail fast; a wrong retry costs money.
Never retry terminal errors. context_length_exceeded, content filters, 400s, and auth errors will fail identically. Route them to a fix or to degradation, never to a loop.
Honor retry-after and reset headers for 429s. The provider already told you when capacity returns. Guessing with exponential backoff is strictly worse.
Use full jitter for systemic backoff. No jitter means a synchronized fleet re-overloads the recovering provider - the metastable trap. random(0, min(cap, base * 2^attempt)).
Bound retries by a wall-clock deadline, not an attempt count. A fixed six attempts of a fifteen-second call can blow a thirty-second budget. Pass the deadline down and stop when it is gone.
Trip the circuit breaker only on systemic failures, and key it per provider. A breaker that opens on your 400s or your 429s takes itself down for the wrong reasons; a global breaker lets one provider's outage disable your fallback.
Make degradation an explicit contract, not an exception. Return a typed "degraded" result the rest of the DAG is built to handle. Bulkhead each agent's budget so one thrashing agent cannot starve the graph.
Budget money, not just time. Track the sunk-token cost of retries. A retry that re-bills a long generation is a different decision than one that does not.
Reach for a gateway (litellm, Portkey) when you want this out of process. Centralized cooldowns, context-window fallbacks, and multi-provider breakers are real and battle-tested - just verify their defaults (litellm's cooldown is one second; tune it).

Exponential backoff is not wrong. It is a precise tool for transient, independent, cheap, idempotent failures, and it is excellent at that job. LLM calls are simply not that job. The moment you stop treating "an LLM call failed" as one event and start treating it as four different events with four different costs, your pipeline stops amplifying outages and starts containing them. Classify first. Retry last.

References

Anthropic. Rate limits. https://platform.claude.com/docs/en/api/rate-limits (accessed 2026-06-16)
Anthropic. Errors. https://platform.claude.com/docs/en/api/errors (accessed 2026-06-16)
Brooker, M. Exponential Backoff And Jitter. AWS Architecture Blog. https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Amazon Builders' Library. Timeouts, retries, and backoff with jitter. https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
Brooker, M. (2021). Metastability and Distributed Systems. https://brooker.co.za/blog/2021/05/24/metastable.html
Bronson, N., Aghayev, A., Charapko, A., Zhu, T. (2021). Metastable Failures in Distributed Systems. HotOS '21. https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf
Dean, J. & Barroso, L.A. (2013). The Tail at Scale. Communications of the ACM 56(2), 74-80. https://cacm.acm.org/research/the-tail-at-scale/
Nygard, M. Release It! (2nd ed.). Pragmatic Bookshelf. https://pragprog.com/titles/mnee2/
Fowler, M. CircuitBreaker. https://martinfowler.com/bliki/CircuitBreaker.html
Netflix. Hystrix (maintenance mode). https://github.com/Netflix/Hystrix
LangChain. LangGraph RetryPolicy. https://reference.langchain.com/python/langgraph/types/RetryPolicy
LangChain. LangGraph error handling and retry policies (DeepWiki). https://deepwiki.com/langchain-ai/langgraph/3.8-error-handling-and-retry-policies
litellm. Router - fallbacks, retries, cooldowns. https://docs.litellm.ai/docs/routing
Portkey. Circuit Breaker. https://portkey.ai/docs/product/ai-gateway/circuit-breaker
tenacity. Documentation. https://tenacity.readthedocs.io
Pan, T. (2026). Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill. https://tianpan.co/blog/2026-04-16-retry-budget-llm-agent-cost-amplification

Agentic AI

Systems Design

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:

Fault Isolation and Circuit Breaking: Stop Retrying LLM Calls Like Microservices

The thesis: the microservice resilience playbook is wrong for LLM calls

Why DAG-structured agent pipelines amplify the damage

The wrong way: one retry policy wrapped around every call

The right way: classify the failure, then route it

Failure-Class Routing and the Sunk-Token Retry

How the failure classes actually behave at the provider

Isolating the fault: bulkheads and the partial-failure contract

Decision guide: resilience for LLM agent pipelines

References

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments

The thesis: the microservice resilience playbook is wrong for LLM calls

Why DAG-structured agent pipelines amplify the damage

The wrong way: one retry policy wrapped around every call

The right way: classify the failure, then route it

Failure-Class Routing and the Sunk-Token Retry

How the failure classes actually behave at the provider

Isolating the fault: bulkheads and the partial-failure contract

Decision guide: resilience for LLM agent pipelines

References

Related Articles

Books by Ranjan Kumar

Building Real-World Agentic AI Systems

The ChatML Handbook

The Chat Templates Handbook

Comments