← Back to Guides
6

Series

RAG Engineering in Production· Part 6

GuideFor: AI Engineers, ML Engineers, Platform Engineers, AI Systems Architects

Why Your Agentic RAG System Costs 10x More Than It Should

Wrapping a well-built RAG pipeline in an agent loop multiplies every cost in the stack by the number of retrieval iterations - and most teams have no budget ceiling, no per-session spend tracking, and no circuit breaker.

#rag#agentic-rag#cost-governance#langgraph#token-budget#observability#production-ai#llm-infrastructure

In November 2025, a market research pipeline running four LangChain agents entered an unintended loop. Two of the agents - an Analyzer and a Verifier - began exchanging requests: the Analyzer generated content, the Verifier asked for further analysis, the Analyzer obliged. Neither agent had a budget ceiling. Neither had a mechanism to terminate the session before the next API call completed. The loop ran for 264 hours. The bill was $47,000. Nobody noticed until it was over.

The post-mortem identified two root causes: no per-agent budget caps, and no enforcement layer between the agent's decision to make another call and the LLM API completing it. The team had observability. They had alerts. The alerts fired. Nobody acted on them in time. Observability without enforcement is a dashboard, not a control.

That incident is an extreme case of the same structural problem every agentic RAG system carries: wrapping a retrieval pipeline in an agent loop multiplies every cost in the stack by the number of iterations the agent decides to take. A single-pass RAG pipeline has a deterministic cost envelope per query. An agent loop does not. The agent decides how many times to retrieve, how many times to rewrite the query, how many times to evaluate its own output before it is satisfied. Without a governance layer that enforces a ceiling, those decisions have no floor under them in production.

The thesis of this article is direct: agentic RAG is worth the cost premium for the queries that need it. The problem is that most systems apply it uniformly to all queries, have no measurement of where the cost actually goes, and have no enforcement mechanism that activates before the bill arrives.

This is the final piece of the diagnostic framework this series has built. The Retrieval Tax from Part 1, the Chunking Debt from Part 2, the Semantic Compression Loss from Part 3, the Precision Gap from Part 4, the Evals Blind Spot from Part 5 - each was a cost your system paid invisibly at a specific pipeline layer. The Orchestration Overhead from Part 6 is what happens when all five of those costs are multiplied by an agent loop with no budget governance.


What the Orchestration Overhead Actually Is

The Orchestration Overhead is not a single cost. It is the product of three compounding components that activate when a well-built single-pass RAG pipeline is wrapped in an agent control loop.

The Loop Tax. Each iteration of the agent loop incurs the full cost of one retrieval pass: embedding the query (or rewritten query), vector search, BM25, hybrid fusion, reranking, and LLM inference over the retrieved context. A single-pass pipeline pays this cost once. An agent that decides to retrieve three times pays it three times. At scale, at 40,000 queries per day, the difference between an average of 1.3 and 2.6 retrieval iterations per query doubles your retrieval infrastructure bill with no corresponding improvement for the 60% of queries that did not need the extra pass.

Context Accumulation. Each iteration the agent takes grows the context window it carries. After round one, the prompt contains the original query plus the first retrieved context plus the agent's reasoning about whether it is sufficient. After round two, it contains all of that plus the second retrieved context plus the agent's re-evaluation. Token cost for LLM inference scales linearly with context length. An agent that takes four retrieval passes and accumulates 8,000 tokens of context by generation is paying 4-6x the inference cost of a single-pass retrieval that assembled the same information in one step.

The Governance Vacuum. The most expensive component is not the loops themselves - it is the absence of a policy layer that enforces a ceiling before costs accumulate. In single-pass RAG, the cost per query is bounded by the pipeline structure: one retrieval call, one generation call. In agentic RAG, the only natural ceiling is the agent's own confidence evaluation. In production, with ambiguous queries, uneven corpus coverage, or edge cases the agent's confidence heuristic was not designed for, that ceiling can be infinite.

Together: Loop Tax × Context Accumulation × Governance Vacuum = the Orchestration Overhead. Teams that add an agent loop without addressing all three components ship a system whose cost in production is structurally unpredictable.


The Wrong Way: Uniform Agentic Retrieval Without Governance

code
# Wrong way: wrap every query in the agent loop regardless of complexity# This pattern ships in most agentic RAG tutorials.from langgraph.graph import StateGraph, ENDfrom typing import TypedDict, Annotatedimport operatorclass AgentState(TypedDict):    query: str    retrieved_docs: Annotated[list[str], operator.add]    answer: str    iterations: intdef retrieve(state: AgentState) -> AgentState:    """Retrieve documents for the current query."""    docs = retriever.invoke(state["query"])    return {"retrieved_docs": [d.page_content for d in docs]}def evaluate_sufficiency(state: AgentState) -> str:    """Ask the LLM if it has enough information."""    # This is where the governance vacuum lives.    # The agent evaluates its own confidence.    # There is no ceiling on how many times it can say "not enough".    # There is no token budget tracked across iterations.    # There is no circuit breaker.    prompt = f"""Given this context:{state['retrieved_docs']}Can you answer: {state['query']}? Reply SUFFICIENT or INSUFFICIENT."""        result = llm.invoke(prompt).content.strip()    return "generate" if result == "SUFFICIENT" else "retrieve"def generate(state: AgentState) -> AgentState:    context = "\n\n".join(state["retrieved_docs"])    answer = llm.invoke(        f"Context:\n{context}\n\nQuestion: {state['query']}\nAnswer:"    ).content    return {"answer": answer}# Build graph - no iteration limit, no token budget, no circuit breakergraph = StateGraph(AgentState)graph.add_node("retrieve", retrieve)graph.add_node("evaluate", evaluate_sufficiency)graph.add_node("generate", generate)graph.set_entry_point("retrieve")graph.add_edge("retrieve", "evaluate")graph.add_conditional_edges("evaluate", evaluate_sufficiency)graph.add_edge("generate", END)# Every query - simple FAQ, complex multi-hop, everything -# goes through this loop with no differentiation.# Simple queries: pay for 1-3 iterations they did not need.# Edge cases: may loop indefinitely until the context window fills.# Token budget per session: undefined.# Cost per query: unknown until the billing statement arrives.agent = graph.compile()

The Right Way: Budget-Governed Agentic RAG with Intent Routing

The correct architecture has three components: intent classification before the agent loop, hard budget enforcement inside the loop, and per-session spend tracking visible in production.

Component 1: Intent Router - Skip the Loop When You Do Not Need It

Not every query warrants an agent loop. Classify query complexity first and route to the cheapest path that can answer it. A question like "what is our refund policy" does not need three retrieval iterations. It needs one pass. Sending it through an agent loop pays 3-10x more for an answer that was available after the first retrieval.

code
from enum import Enumfrom langchain_openai import ChatOpenAIclass QueryIntent(Enum):    DIRECT = "direct"        # No retrieval: LLM can answer from knowledge    SIMPLE = "simple"        # Single-pass: one retrieval round sufficient    COMPLEX = "complex"      # Agent loop: multi-hop reasoning requireddef classify_intent(query: str, fast_llm: ChatOpenAI) -> QueryIntent:    """    Classify query complexity using a fast, cheap model.    Route to the cheapest pipeline that can answer.        Use a small model here (gpt-4o-mini, claude-haiku) not your primary LLM.    The classification call cost is ~$0.000015 - trivial vs the savings    from skipping the agent loop on 60-70% of queries.        Calibrate against your production query distribution:    - Target: SIMPLE handles ~60-70% of queries    - COMPLEX handles ~20-30% (multi-hop, synthesis, comparison)    - DIRECT handles ~5-10% (greetings, basic facts)    """    prompt = """Classify this query for retrieval routing.Query: {query}DIRECT: answerable from general knowledge, no documents neededSIMPLE: requires retrieving from the knowledge base, single lookup sufficientCOMPLEX: requires multiple retrieval steps, sub-query decomposition, or synthesisReply with one word: DIRECT, SIMPLE, or COMPLEX""".format(query=query)        result = fast_llm.invoke(prompt).content.strip().upper()    try:        return QueryIntent(result.lower())    except ValueError:        return QueryIntent.SIMPLE  # Safe defaultdef route_query(query: str, fast_llm, primary_llm, retriever, agent) -> str:    """    Route based on intent classification.    Only COMPLEX queries enter the agent loop.    SIMPLE queries: single-pass retrieve and generate.    DIRECT queries: skip retrieval entirely.        fast_llm: cheap small model for classification (gpt-4o-mini, claude-haiku)    primary_llm: your main generation model (gpt-4o, claude-sonnet)    """    intent = classify_intent(query, fast_llm)        if intent == QueryIntent.DIRECT:        return primary_llm.invoke(query).content        elif intent == QueryIntent.SIMPLE:        # Single-pass: Part 1-4 pipeline, no agent loop        docs = retriever.invoke(query)        context = "\n\n".join([d.page_content for d in docs[:5]])        return primary_llm.invoke(            f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"        ).content        else:  # QueryIntent.COMPLEX        # Agent loop - with budget governance (see Component 2)        return agent.invoke({"query": query, "budget_tokens": 8000})

Component 2: Hard Budget Enforcement Inside the Loop

Token budget alerts are not enforcement. An alert that fires after the session has consumed 10,000 tokens does not stop the 11th thousand. Enforcement requires evaluating the budget ceiling before each API call and terminating the session if the ceiling is reached.

code
from typing import TypedDict, Annotatedfrom langgraph.graph import StateGraph, ENDimport operatorimport tiktokenclass BudgetedAgentState(TypedDict):    query: str    retrieved_docs: Annotated[list[str], operator.add]    answer: str    iterations: int    tokens_used: int    budget_tokens: int   # Hard ceiling set at session start    terminated_early: bool    next_step: str       # Set by check_budget_and_evaluate, read by routerdef count_tokens(text: str, model: str = "gpt-4o") -> int:    """Count tokens for budget tracking."""    enc = tiktoken.encoding_for_model(model)    return len(enc.encode(text))def check_budget_and_evaluate(state: BudgetedAgentState) -> BudgetedAgentState:    """    Node function: evaluates agent confidence and updates state with routing signal.    Returns state update dict - does NOT return routing string directly.        Sets state["next_step"] which the router reads to pick the next edge.    """    MAX_ITERATIONS = 4        # Hard ceilings - terminate without LLM call    if state["tokens_used"] >= state["budget_tokens"]:        return {"next_step": "generate_with_warning"}        if state["iterations"] >= MAX_ITERATIONS:        return {"next_step": "generate_with_warning"}        # Confidence evaluation    prompt = f"""Context gathered so far:{chr(10).join(state['retrieved_docs'][-10:])}Can you answer: {state['query']}?Reply SUFFICIENT or INSUFFICIENT."""        tokens_this_call = count_tokens(prompt)        # Pre-flight: will this call push us over budget?    if state["tokens_used"] + tokens_this_call > state["budget_tokens"]:        return {            "next_step": "generate_with_warning",            "tokens_used": state["tokens_used"] + tokens_this_call,        }        result = llm.invoke(prompt).content.strip()    next_step = "generate" if result == "SUFFICIENT" else "retrieve"    return {        "next_step": next_step,        "tokens_used": state["tokens_used"] + tokens_this_call,    }def route_from_budget_check(state: BudgetedAgentState) -> str:    """    Pure routing function: reads state["next_step"] set by the node above.    Returns the edge key for conditional routing.    This is separate from the node function - LangGraph requires this separation.    """    return state.get("next_step", "retrieve")def budgeted_retrieve(state: BudgetedAgentState) -> BudgetedAgentState:    """Retrieve and track token spend."""    query = state["query"]    docs = retriever.invoke(query)    doc_texts = [d.page_content for d in docs[:5]]        # Track tokens consumed by this retrieval round    tokens_this_round = sum(count_tokens(t) for t in doc_texts)        return {        "retrieved_docs": doc_texts,        "iterations": state["iterations"] + 1,        "tokens_used": state["tokens_used"] + tokens_this_round,        "terminated_early": False,    }def budgeted_generate(state: BudgetedAgentState) -> BudgetedAgentState:    """Generate from whatever context was assembled within budget."""    context = "\n\n".join(state["retrieved_docs"])    answer = llm.invoke(        f"Context:\n{context}\n\nQuestion: {state['query']}\nAnswer:"    ).content    return {"answer": answer, "terminated_early": False}def generate_with_budget_warning(    state: BudgetedAgentState,) -> BudgetedAgentState:    """    Generate from partial context when budget ceiling is hit.    Flags the response for downstream monitoring.        Returning a flagged answer is better than looping indefinitely.    The flag surfaces in Langfuse/LangSmith traces as a metric    to monitor: what fraction of agent sessions hit the ceiling?    If that fraction is high, raise the ceiling or improve retrieval    so fewer iterations are needed.    """    context = "\n\n".join(state["retrieved_docs"])    answer = llm.invoke(        f"Context (budget limit reached after {state['iterations']} "        f"iterations):\n{context}\n\nQuestion: {state['query']}\n"        f"Answer based on available context:"    ).content    return {        "answer": answer,        "terminated_early": True,  # Flagged for monitoring    }def build_budgeted_agent():    """    Assemble the budget-governed agent graph.        Node separation (LangGraph requirement):    - check_budget_and_evaluate: state-updating node, sets state["next_step"]    - route_from_budget_check: pure routing function, reads state["next_step"]        The routing function is passed to add_conditional_edges, NOT added as a node.    """    graph = StateGraph(BudgetedAgentState)        graph.add_node("retrieve", budgeted_retrieve)    graph.add_node("check_budget_and_evaluate", check_budget_and_evaluate)    graph.add_node("generate", budgeted_generate)    graph.add_node("generate_with_warning", generate_with_budget_warning)        graph.set_entry_point("retrieve")    graph.add_edge("retrieve", "check_budget_and_evaluate")        # Routing function reads state["next_step"] set by the node above    graph.add_conditional_edges(        "check_budget_and_evaluate",        route_from_budget_check,  # Pure router - not a node        {            "retrieve": "retrieve",            "generate": "generate",            "generate_with_warning": "generate_with_warning",        }    )    graph.add_edge("generate", END)    graph.add_edge("generate_with_warning", END)        return graph.compile()

Component 3: Per-Session Spend Tracking in Production

Budget enforcement inside the loop prevents runaway sessions. Per-session spend tracking in production tells you whether your budget ceiling is calibrated correctly and where the cost is actually going.

code
import timefrom dataclasses import dataclass, fieldfrom langfuse import Langfuselangfuse = Langfuse()@dataclassclass SessionMetrics:    session_id: str    query: str    intent: str    iterations: int = 0    tokens_used: int = 0    latency_ms: float = 0.0    terminated_early: bool = False    cost_usd: float = 0.0def track_agent_session(    session_id: str,    query: str,    intent: str,    agent_result: dict,    start_time: float,    token_price_per_million: float = 3.00,  # gpt-4o as of 2026) -> SessionMetrics:    """    Capture per-session metrics for cost governance monitoring.        Key signals to watch in production:        1. Average iterations per session by intent bucket       - If COMPLEX queries average 4+ iterations, your retrieval         quality is poor - the agent is searching repeatedly because         single-pass results are insufficient. Fix upstream (Parts 1-4).        2. terminated_early rate       - If >10% of sessions hit the budget ceiling, your ceiling         is too tight OR your corpus coverage is too sparse.         Distinguish the two by checking context_recall (Part 5).        3. Cost per session by intent       - SIMPLE sessions should cost 3-10x less than COMPLEX.         If they do not, your intent router is mis-classifying.        4. Token accumulation rate per iteration       - If context grows faster than 1,000 tokens per iteration,         your retrieval is returning too many documents.         Reduce top-k or tighten the reranker threshold.    """    metrics = SessionMetrics(        session_id=session_id,        query=query,        intent=intent,        iterations=agent_result.get("iterations", 0),        tokens_used=agent_result.get("tokens_used", 0),        latency_ms=(time.time() - start_time) * 1000,        terminated_early=agent_result.get("terminated_early", False),        cost_usd=agent_result.get("tokens_used", 0)        / 1_000_000        * token_price_per_million,    )        # Log to Langfuse for observability    langfuse.trace(        name="agentic_rag_session",        id=session_id,        metadata={            "intent": intent,            "iterations": metrics.iterations,            "tokens_used": metrics.tokens_used,            "terminated_early": metrics.terminated_early,            "cost_usd": metrics.cost_usd,            "latency_ms": metrics.latency_ms,        },    )        return metrics

The Agentic RAG Cost Governance Diagram

mermaid
flowchart TD
    Q[User query] --> IC[Intent classifier\nfast cheap model]
    IC -- DIRECT\n~5-10% of queries --> LLM0[LLM only\nno retrieval\n1x cost]
    IC -- SIMPLE\n~60-70% of queries --> SP[Single-pass RAG\nParts 1-4 pipeline\n1x cost]
    IC -- COMPLEX\n~20-30% of queries --> BG{Budget gate\ntokens_used vs ceiling\niterations vs MAX}
    BG -- Under budget\nunder iteration limit --> RT[Retrieve + rerank\nParts 1-4 pipeline]
    RT --> CE{Agent confidence\nevaluation}
    CE -- SUFFICIENT --> GN[Generate\ncollect metrics]
    CE -- INSUFFICIENT --> BG
    BG -- Ceiling hit --> GW[Generate with\nbudget warning\nflagged for monitoring]
    SP --> GN
    LLM0 --> GN
    GW --> MT[Per-session metrics\nLangfuse trace]
    GN --> MT
    MT --> AL{Anomaly\ndetection}
    AL -- iterations avg above 3 --> FX1[Fix upstream retrieval\nParts 1-4]
    AL -- early termination above 10pct --> FX2[Raise budget ceiling\nor fix corpus coverage]
    AL -- intent misclassification --> FX3[Retrain intent router\non production distribution]
    AL -- Healthy --> OK[No action]

    style Q fill:#4A90E2,color:#fff
    style IC fill:#7B68EE,color:#fff
    style LLM0 fill:#6BCF7F,color:#fff
    style SP fill:#4A90E2,color:#fff
    style BG fill:#7B68EE,color:#fff
    style RT fill:#4A90E2,color:#fff
    style CE fill:#9B59B6,color:#fff
    style GN fill:#6BCF7F,color:#fff
    style GW fill:#FFD93D,color:#333
    style MT fill:#98D8C8,color:#333
    style AL fill:#7B68EE,color:#fff
    style FX1 fill:#E74C3C,color:#fff
    style FX2 fill:#FFA07A,color:#333
    style FX3 fill:#FFA07A,color:#333
    style OK fill:#6BCF7F,color:#fff

The diagram makes the governance structure explicit. The budget gate is not a retry limiter on a single call - it is a session-level controller that evaluates cumulative spend before authorizing each new retrieval iteration. The anomaly detection layer connects the monitoring output back to the five upstream pipeline layers: high iteration counts signal retrieval quality failures from Parts 1-4; high early-termination rates signal corpus coverage gaps that evaluation from Part 5 would have caught.


When Agentic RAG Earns Its Cost Premium

The Orchestration Overhead is not an argument against agentic RAG. It is an argument for applying it selectively and governing it explicitly. For the right query types, the cost premium is justified:

Multi-hop factual synthesis. "Compare the termination clause across our five most recent vendor contracts and flag any that differ from our standard terms." This query requires retrieving from five different document subsets, comparing across them, and synthesizing a finding. Single-pass retrieval cannot do this. Three to four agent iterations is the correct budget.

Ambiguous queries with self-correction. When the initial retrieval returns documents that are relevant in topic but insufficient in specificity, a self-correcting loop that rewrites the query and retrieves again is the right mechanism. The Evals Blind Spot from Part 5 showed that context recall below 0.7 is a retrieval failure - agentic loops are one production mechanism for compensating when first-pass retrieval is insufficient.

High-stakes domains where verification matters. Legal, medical, financial. For these domains, an agent that retrieves, checks its answer against a second retrieval, and flags low-confidence results before returning is worth the 3-5x cost premium over a single-pass pipeline that is confidently wrong.

What does NOT earn the premium:

  • FAQ-style lookups where the answer is in one document
  • Structured queries that should be SQL RAG (Part 1)
  • Queries where the single-pass pipeline already achieves context recall above 0.85 (Part 5 threshold)

The intent router exists to make this distinction automatically and route to the cheapest sufficient path. Without it, you pay the COMPLEX cost on SIMPLE queries.


The Series Completion: A Unified Diagnostic Framework

Six parts, six named failure modes, one complete diagnostic vocabulary for production RAG systems:

LayerNamed ConceptWhat It CostsHow to MeasureWhere to Fix
Retrieval strategyThe Retrieval TaxWrong backend per query typenDCG@10 before/after routingPart 1: retrieval strategy decision guide
ChunkingChunking DebtIrretrievable context at index timeContext recall drops below 0.8Part 2: recursive 400-token default
EmbeddingSemantic Compression LossDomain terms mapped to wrong proxiesRecall gap on domain eval setPart 3: domain model or fine-tuning
RerankingThe Precision GapBi-encoder ranks wrong doc at position 1nDCG@10 pre vs post rerankerPart 4: cross-encoder on top-50
EvaluationThe Evals Blind SpotRetrieval failures invisible in prodContext recall below 0.8; no CI gatePart 5: RAGAS golden dataset + CI gate
Agent governanceThe Orchestration OverheadLoop cost with no ceilingAvg iterations, tokens/sessionPart 6: intent router + budget gate

These six concepts are not independent. They interact:

  • High Orchestration Overhead is often caused by high Retrieval Tax or Precision Gap - the agent loops because first-pass retrieval is insufficient
  • The Evals Blind Spot is what makes Chunking Debt and Semantic Compression Loss invisible until production incidents expose them
  • The Retrieval Tax sets the floor on Orchestration Overhead: if each retrieval iteration is already using the wrong strategy, multiplying it by three iterations compounds the error

A RAG system with all six layers governed is a system where failures have names, costs have ceilings, and degradation is detected before users report it.


Agentic RAG Cost Governance Checklist

Before enabling the agent loop:

  • Intent router built and calibrated against production query distribution
  • COMPLEX query fraction measured: if above 40%, re-examine classification thresholds
  • Per-session token budget defined per intent tier (COMPLEX ceiling: e.g. 8,000-12,000 tokens)
  • MAX_ITERATIONS ceiling set: 3-4 for most domains; BCAS paper shows diminishing returns beyond this
  • Single-pass pipeline (Parts 1-4) validated at context recall above 0.8 before adding agent loop

Inside the agent loop:

  • Budget gate enforces token ceiling before each retrieval decision - not after
  • Iteration counter enforces MAX_ITERATIONS ceiling
  • terminated_early flag emitted when either ceiling hits: observable in Langfuse/LangSmith
  • Context accumulation rate tracked: if tokens grow faster than 1,000 per iteration, reduce top-k

Production observability:

  • Average iterations per session tracked per intent tier
  • Cost per session tracked and surfaced in dashboards (not just monthly billing)
  • Early termination rate monitored: above 10% triggers ceiling review
  • Intent misclassification rate estimated quarterly: sample complex sessions that resolved in one iteration

Governance escalation:

  • If avg iterations above 3 for COMPLEX queries: diagnose retrieval quality (Parts 1-4 pipeline)
  • If early termination rate above 10%: evaluate whether ceiling is too tight or corpus coverage is insufficient (Part 5: context recall on golden dataset)
  • If cost per SIMPLE session equals cost per COMPLEX: intent router is mis-classifying; retrain on production distribution
  • Post-incident: any unintended loop that runs more than 10 iterations before terminating = circuit breaker gap; add hard kill at session level

References


Agentic AI

Retrieval Augmented Generation

Follow for more technical deep dives on AI/ML systems, production engineering, and building real-world applications:


Comments