← Back to Home

MCP vs RAG vs Tools: When to Use Each (and When Not To)

llm-architectureproduction-aisystem-design
#mcp#rag#tool-calling#llm-orchestration#agentic-systems#production-llm#ai-architecture#context-management#model-capabilities

The Problem: Everyone's Building the Same System Three Different Ways

Walk into any AI engineering team right now and ask how they're handling external data access for their LLM. You'll get three different answers from three different engineers, all of whom think the others are doing it wrong.

One team has built an elaborate RAG pipeline with a vector database, embedding models, and reranking stages. Another team is using function calling with dozens of custom tools. The newest engineer just discovered Model Context Protocol and wants to rewrite everything.

They're all solving the same problem: getting an LLM to interact with data and systems it wasn't trained on. But they're using fundamentally different architectural patterns, and most teams can't articulate why they chose one over the others.

The real problem isn't that these approaches exist. It's that teams treat them as interchangeable. They're not. Each one makes different assumptions about latency tolerance, data freshness requirements, control flow, and failure modes. Pick wrong, and you'll spend months refactoring a system that never quite worked.

I've seen teams build RAG systems for what should have been simple tool calls. I've watched engineers implement MCP servers for use cases that needed deterministic APIs. The worst part? These mistakes don't fail obviously. They fail slowly, quietly, with increasing operational complexity until someone finally asks: "Why does our AI feature cost $50k/month and still hallucinate?"

This article explains when to use each approach and, more importantly, when not to. No hand-waving, no "it depends" without specifics. Just the architectural decision tree that took me two years of production deployments to build.

The Mental Model: Three Different Control Flow Patterns

Before diving into implementation details, you need to understand that RAG, Tools, and MCP are not three flavors of the same thing. They represent three fundamentally different control flow patterns.

RAG is preprocessing. You frontload all data retrieval before the model generates anything. The model never decides what to fetch or when. It receives context upfront and generates from that. Control flow is linear: retrieve, augment, generate.

Tools are imperative execution. The model requests specific actions during generation, waits for responses, and continues. It's a synchronous request-response loop. The model orchestrates, but execution happens outside. Control flow is iterative: generate, tool call, wait, continue generating.

MCP is declarative capability exposure. You define available resources and operations upfront via a protocol. The model can query what's available and request access during generation. Control flow is negotiated: the model discovers capabilities, requests resources, receives responses based on protocol rules.

The key distinction most teams miss: RAG optimizes for context relevance, Tools optimize for action execution, MCP optimizes for dynamic capability discovery.

When you use RAG, you're betting that the right information can be identified before generation. When you use Tools, you're betting the model can decide which actions to take during generation. When you use MCP, you're betting that capability discovery should happen at runtime rather than being hardcoded.

These are different bets with different payoffs.

Understanding this matters because picking the wrong pattern creates friction everywhere. RAG systems struggle when you can't predict what context is needed upfront. Tool-based systems become fragile when the model must orchestrate complex multi-step procedures. MCP adds overhead when capabilities are static and known in advance.

The decision isn't "which is best?" It's "which control flow matches my problem?"

Architecture: How Each Pattern Actually Works

RAG Architecture

Figure: RAG Architecture
Figure: RAG Architecture

In RAG, retrieval happens first and completely. The model never sees the query alone. It only sees: query + retrieved context. This is the core invariant.

Key architectural properties:

State lives in the vector database. Your retrieval quality determines output quality. If embeddings are stale, outputs are wrong. If chunking is poor, relevant context gets missed.

Decisions happen at retrieval time. The embedding model + similarity search determines what the LLM sees. The LLM doesn't participate in this decision. It can't say "no, fetch different documents."

Failure modes are invisible. If retrieval returns poor results, the model generates anyway. It has no mechanism to reject bad context or request better information.

Cost scales with context size, not query complexity. Whether the query is simple or complex, you pay for the tokens in your retrieved chunks. A 10-word question with 10,000 tokens of context costs more than a 100-word question with 1,000 tokens of context.

Tool-Based Architecture

Figure: Tool-Based Architecture
Figure: Tool-Based Architecture

In tool-based systems, the model is the orchestrator. It decides when to call tools, which tools to call, and with what parameters. Execution is gated by model decisions.

Key architectural properties:

State lives in function definitions. Tools are explicitly declared with schemas. The model must know about a tool to call it. Adding new tools means updating model context or fine-tuning.

Decisions happen during generation. The model actively chooses when to stop generating text and start calling tools. It can chain multiple tools, retry on failure, or decide tools aren't needed.

Failure modes are explicit. If a tool call fails, the model sees the error. It can retry, call a different tool, or explain the failure to the user.

Cost scales with tool complexity. Each tool call adds latency (request-response roundtrip) and tokens (tool definition + parameters + response). Complex workflows with 10+ tool calls become expensive quickly.

MCP Architecture

Figure: MCP Architecture
Figure: MCP Architecture

MCP sits between tools and RAG. Resources and operations are exposed through standardized protocol messages. The model can discover what's available and request access, but servers control authorization and execution.

Key architectural properties:

State lives in protocol negotiations. MCP servers advertise capabilities. The LLM client queries what's available, requests resources, and receives responses according to protocol rules.

Decisions are distributed. The model decides what to request. The server decides whether to fulfill it. This is fundamentally different from tools, where the model's decision is authoritative.

Failure modes are negotiable. If a resource isn't available, the server can suggest alternatives, require different parameters, or explain constraints. This is richer than tool failures.

Cost scales with protocol overhead. Every capability discovery, resource request, and response flows through MCP message serialization. For simple cases, this is pure overhead. For complex dynamic systems, it pays off.

Implementation: When Each Pattern Actually Works

Use RAG When: Questions Have Predictable Information Needs

RAG works well when you can reliably identify relevant context before generation. This is narrower than most teams assume.

Good RAG use cases:

Customer support where questions map cleanly to documentation sections. "How do I reset my password?" clearly needs the password reset docs. "What's your refund policy?" needs the refund policy section.

Internal knowledge bases with well-structured content. If your wiki has clear topic boundaries and good metadata, semantic search works reliably.

Domain-specific Q&A where the corpus is bounded and relatively static. Legal contract analysis, medical literature search, compliance documentation.

Bad RAG use cases:

Questions requiring multi-step reasoning across disconnected facts. "Which customers should we prioritize based on LTV, recent support tickets, and product usage?" can't be answered by retrieving a few documents.

Real-time data needs. If information changes frequently (prices, inventory, user state), your vector embeddings are always stale.

Procedural tasks. "Deploy this application to production" isn't answered by retrieving deployment docs. It requires executing commands.

Implementation Pattern: RAG Pipeline

code
from typing import List, Dictimport numpy as npfrom dataclasses import dataclass@dataclassclass RetrievalConfig:    top_k: int = 5    similarity_threshold: float = 0.7    max_context_tokens: int = 4000    rerank: bool = Trueclass RAGPipeline:    def __init__(        self,        embedding_model,        vector_store,        llm_client,        config: RetrievalConfig    ):        self.embedding_model = embedding_model        self.vector_store = vector_store        self.llm = llm_client        self.config = config        def retrieve(self, query: str) -> List[Dict]:        """        Critical: This must return relevant chunks or fail visibly.        Silent irrelevance is the death of RAG systems.        """        query_embedding = self.embedding_model.encode(query)                results = self.vector_store.search(            query_embedding,            top_k=self.config.top_k        )                # Filter by similarity threshold        filtered = [            r for r in results             if r['similarity'] >= self.config.similarity_threshold        ]                if not filtered:            # Don't generate from empty context            # This is where most RAG systems fail silently            raise InsufficientContextError(                f"No results above threshold {self.config.similarity_threshold}"            )                if self.config.rerank:            # Reranking is expensive but often necessary            # Embedding similarity != relevance for complex queries            filtered = self._rerank(query, filtered)                return self._assemble_context(filtered)        def generate(self, query: str, context: List[Dict]) -> str:        """        Construct prompt with explicit context boundaries.        The model must know where context ends and generation begins.        """        context_text = "\n\n".join([            f"Document {i+1}:\n{doc['content']}"            for i, doc in enumerate(context)        ])                prompt = f"""Answer the question using only the provided context.If the context doesn't contain enough information, say so explicitly.Context:{context_text}Question: {query}Answer:"""                return self.llm.generate(            prompt,            max_tokens=500,            temperature=0.1  # Lower temperature for factual accuracy        )        def run(self, query: str) -> Dict:        """        Full pipeline with explicit error handling.        Every failure mode must be observable.        """        try:            context = self.retrieve(query)            answer = self.generate(query, context)                        return {                "answer": answer,                "sources": [doc['metadata'] for doc in context],                "confidence": self._estimate_confidence(query, context)            }        except InsufficientContextError as e:            return {                "answer": None,                "error": "insufficient_context",                "message": str(e)            }        except Exception as e:            # Don't let generation errors look like answers            return {                "answer": None,                "error": "generation_failed",                "message": str(e)            }# Usageconfig = RetrievalConfig(    top_k=5,    similarity_threshold=0.75,    max_context_tokens=3000,    rerank=True)pipeline = RAGPipeline(embedding_model, vector_store, llm, config)result = pipeline.run("How do I reset my password?")

Production considerations for RAG:

Chunk size matters more than people think. Too small and context is fragmented. Too large and irrelevant content dilutes signal. Start with 512 tokens, measure retrieval quality, adjust.

Similarity thresholds prevent silent failures. If you don't filter low-similarity results, you feed garbage to the model. If your threshold is too high, you reject valid context. This requires empirical tuning per corpus.

Reranking is almost always worth it. Embedding similarity is a proxy for relevance, not relevance itself. Cross-encoders cost more but catch what bi-encoders miss.

Context token limits force trade-offs. You can retrieve 50 chunks but only pass 10 to the model. How do you choose? This is where teams often introduce subtle bugs.

Use Tools When: Actions Are Discrete and Parameterizable

Tools work when you can define clear boundaries around operations and the model can decide which operations to invoke.

Good tool use cases:

Database queries where parameters come from user intent. "Show me sales from last quarter" → query_sales(start_date="2023-10-01", end_date="2023-12-31")

API calls with clear input/output contracts. "Get weather for San Francisco" → get_weather(location="San Francisco, CA")

Calculations the model shouldn't attempt. "What's 15% of $1,234.56?" → calculate(expression="0.15 * 1234.56")

Multi-step workflows where intermediate results inform next steps. Book a flight, reserve hotel, send confirmation email.

Bad tool use cases:

Exploratory analysis requiring dozens of iterative queries. The model becomes a slow, expensive query planner.

Real-time streaming data. Tool calls are synchronous request-response. They don't handle continuous updates.

Operations requiring complex authorization logic. If tool availability depends on runtime context, MCP is better.

Implementation Pattern: Tool-Augmented LLM

code
from typing import List, Dict, Callable, Anyfrom dataclasses import dataclassimport json@dataclassclass Tool:    name: str    description: str    parameters: Dict[str, Any]  # JSON Schema    function: Callableclass ToolExecutor:    def __init__(self, llm_client):        self.llm = llm_client        self.tools: Dict[str, Tool] = {}        def register_tool(self, tool: Tool):        """        Register a tool with the executor.        Each tool must have a complete JSON schema.        """        self.tools[tool.name] = tool        def _create_tool_prompt(self, query: str) -> str:        """        Critical: Tool definitions must be precise.        Ambiguous descriptions lead to incorrect tool calls.        """        tool_descriptions = "\n".join([            f"{name}: {tool.description}\nParameters: {json.dumps(tool.parameters, indent=2)}"            for name, tool in self.tools.items()        ])                return f"""You have access to these tools:{tool_descriptions}To use a tool, respond with JSON:{{"tool": "tool_name", "parameters": {{"param": "value"}}}}If no tool is needed, respond normally.User query: {query}"""        def execute(self, query: str, max_iterations: int = 5) -> Dict:        """        Agentic loop: generate → tool call → incorporate → repeat                Max iterations prevents infinite loops when model gets confused.        """        conversation_history = []                for iteration in range(max_iterations):            prompt = self._create_tool_prompt(query)                        # Add history for multi-turn tool use            if conversation_history:                prompt += "\n\nPrevious interactions:\n" + "\n".join(conversation_history)                        response = self.llm.generate(prompt, temperature=0)                        # Check if response is a tool call            if self._is_tool_call(response):                tool_call = self._parse_tool_call(response)                                # Execute tool                try:                    tool = self.tools[tool_call['tool']]                    result = tool.function(**tool_call['parameters'])                                        conversation_history.append(                        f"Tool: {tool_call['tool']}\nResult: {result}"                    )                                        # Continue generation with tool result                    continue                                    except KeyError:                    return {                        "response": None,                        "error": f"Unknown tool: {tool_call['tool']}",                        "iterations": iteration + 1                    }                except Exception as e:                    # Tool execution failed - model should see this                    conversation_history.append(                        f"Tool: {tool_call['tool']}\nError: {str(e)}"                    )                    continue            else:                # No tool call - this is the final response                return {                    "response": response,                    "tool_calls": len(conversation_history),                    "iterations": iteration + 1                }                # Hit max iterations - likely model confusion        return {            "response": None,            "error": "max_iterations_exceeded",            "iterations": max_iterations        }# Define toolsdef query_database(table: str, filters: Dict) -> List[Dict]:    """Execute database query - actual implementation omitted"""    # In production: validate filters, handle SQL injection, timeout queries    passdef send_email(to: str, subject: str, body: str) -> bool:    """Send email - actual implementation omitted"""    # In production: validate addresses, rate limit, handle failures    pass# Register toolsexecutor = ToolExecutor(llm_client)executor.register_tool(Tool(    name="query_database",    description="Query the sales database for customer or transaction data",    parameters={        "type": "object",        "properties": {            "table": {"type": "string", "enum": ["customers", "transactions", "products"]},            "filters": {"type": "object"}        },        "required": ["table"]    },    function=query_database))executor.register_tool(Tool(    name="send_email",    description="Send an email to a user",    parameters={        "type": "object",        "properties": {            "to": {"type": "string"},            "subject": {"type": "string"},            "body": {"type": "string"}        },        "required": ["to", "subject", "body"]    },    function=send_email))# Execute with toolsresult = executor.execute("Email customers who spent over $1000 last month")

Production considerations for tools:

Tool schemas must be complete. Ambiguous parameter descriptions cause incorrect tool calls. I've debugged systems where "date" meant ISO-8601 in one tool and Unix timestamp in another. The model can't figure this out reliably.

Iteration limits prevent runaway costs. If the model gets confused and keeps calling tools without making progress, you need a circuit breaker. Five iterations catches most real work without allowing infinite loops.

Tool errors must be informative. Don't just return "Error". Tell the model what went wrong so it can retry with different parameters or explain the failure to users.

Latency compounds. Each tool call adds 200ms-2s of roundtrip time. For workflows requiring 10+ tool calls, this becomes user-perceptible delay.

Use MCP When: Capabilities Need Runtime Discovery

MCP shines when the set of available operations isn't known upfront or depends on runtime context.

Good MCP use cases:

Multi-tenant systems where each tenant has different data sources and permissions. User A can access Salesforce and GitHub. User B only gets Jira.

Dynamic integration platforms where available APIs change based on what's connected. If a user hasn't authorized Slack, don't show Slack operations.

Development environments where filesystem access, database connections, and CLI tools vary per workspace.

Bad MCP use cases:

Simple, static tool sets. If you have 5 tools that never change, MCP is overhead for overhead's sake.

Ultra-low latency requirements. MCP's protocol negotiation adds roundtrips.

Environments without persistent MCP servers. If you're spinning up servers per request, the startup cost dominates.

Implementation Pattern: MCP Integration

code
from typing import List, Dict, Optionalimport asynciofrom dataclasses import dataclass@dataclassclass MCPResource:    uri: str    name: str    description: str    mime_type: str@dataclassclass MCPTool:    name: str    description: str    input_schema: Dictclass MCPClient:    """    MCP client that discovers and uses capabilities from MCP servers.    Follows the Model Context Protocol specification.    """        def __init__(self, server_url: str):        self.server_url = server_url        self.available_resources: List[MCPResource] = []        self.available_tools: List[MCPTool] = []        async def initialize(self):        """        Discover what the server offers.        This is what makes MCP different from static tools.        """        # List available resources        resources_response = await self._send_request({            "jsonrpc": "2.0",            "method": "resources/list",            "id": 1        })                self.available_resources = [            MCPResource(**r) for r in resources_response.get("resources", [])        ]                # List available tools        tools_response = await self._send_request({            "jsonrpc": "2.0",            "method": "tools/list",            "id": 2        })                self.available_tools = [            MCPTool(**t) for t in tools_response.get("tools", [])        ]        async def read_resource(self, uri: str) -> Dict:        """        Request access to a specific resource.        The server may deny, redirect, or require additional auth.        """        response = await self._send_request({            "jsonrpc": "2.0",            "method": "resources/read",            "params": {"uri": uri},            "id": 3        })                return response        async def call_tool(self, name: str, arguments: Dict) -> Dict:        """        Invoke a tool through MCP.        The server controls execution and can impose constraints.        """        response = await self._send_request({            "jsonrpc": "2.0",            "method": "tools/call",            "params": {                "name": name,                "arguments": arguments            },            "id": 4        })                return response        def get_capabilities_for_llm(self) -> str:        """        Format available capabilities for LLM context.        This replaces static tool definitions.        """        resources = "\n".join([            f"Resource: {r.name} ({r.uri})\n  {r.description}"            for r in self.available_resources        ])                tools = "\n".join([            f"Tool: {t.name}\n  {t.description}\n  Schema: {t.input_schema}"            for t in self.available_tools        ])                return f"Available Resources:\n{resources}\n\nAvailable Tools:\n{tools}"class MCPOrchestrator:    """    Orchestrate LLM interaction with multiple MCP servers.    """        def __init__(self, llm_client):        self.llm = llm_client        self.mcp_clients: Dict[str, MCPClient] = {}        async def add_server(self, name: str, url: str):        """Add and initialize an MCP server"""        client = MCPClient(url)        await client.initialize()        self.mcp_clients[name] = client        async def execute(self, query: str) -> Dict:        """        Execute query using dynamically discovered MCP capabilities.        """        # Gather all capabilities from all servers        all_capabilities = "\n\n".join([            f"Server: {name}\n{client.get_capabilities_for_llm()}"            for name, client in self.mcp_clients.items()        ])                prompt = f"""You have access to these dynamically available capabilities:{all_capabilities}User query: {query}If you need to access a resource or call a tool, respond with JSON:{{"action": "resource|tool", "server": "server_name", "target": "uri or tool_name", "arguments": {{}}}}"""                response = self.llm.generate(prompt, temperature=0)                if self._is_mcp_action(response):            action = self._parse_action(response)                        client = self.mcp_clients[action['server']]                        if action['action'] == 'resource':                result = await client.read_resource(action['target'])            else:                result = await client.call_tool(                    action['target'],                    action.get('arguments', {})                )                        # Generate final response incorporating MCP result            final_prompt = f"{prompt}\n\nResult: {result}\n\nProvide final answer:"            return {                "response": self.llm.generate(final_prompt),                "mcp_calls": 1            }                return {"response": response, "mcp_calls": 0}# Usageorchestrator = MCPOrchestrator(llm_client)# These servers might expose different capabilities per userawait orchestrator.add_server("filesystem", "http://localhost:3001")await orchestrator.add_server("database", "http://localhost:3002")await orchestrator.add_server("github", "http://localhost:3003")result = await orchestrator.execute(    "Show me files modified in the last week and any related GitHub PRs")

Production considerations for MCP:

Server discovery adds latency. Querying capabilities from multiple servers before every request is expensive. Cache capability lists and refresh periodically.

Authorization complexity increases. With static tools, you control what's available. With MCP, servers expose capabilities dynamically, and you need consistent auth across servers.

Protocol versioning matters. MCP is evolving. Ensure client-server compatibility or handle version mismatches gracefully.

Debugging becomes harder. With tools, you see exactly what was called. With MCP, you need to trace protocol negotiations across multiple servers.

Pitfalls & Failure Modes

RAG Pitfalls

Silent irrelevance. Retrieval returns documents that don't answer the question, but the model generates anyway. The output looks confident but is wrong.

Detection: Monitor similarity scores. If they're consistently below 0.75, retrieval quality is poor.

Prevention: Set hard similarity thresholds. Return "I don't have enough information" rather than hallucinating.

Chunking boundary problems. The answer spans multiple chunks, but retrieval only gets one. The model works with incomplete information.

Detection: Track questions where retrieved chunks have high similarity but low answer quality.

Prevention: Implement overlapping chunks. Use parent-child document relationships. Consider hierarchical retrieval.

Context dilution. You retrieve 10 chunks, but only 2 are relevant. The other 8 confuse the model.

Detection: Compare answer quality with top-1 retrieval vs. top-5 vs. top-10.

Prevention: Aggressive filtering. Reranking. Don't pass everything to the model just because you retrieved it.

Embedding drift. Your documents change, but embeddings don't update. Retrieval returns outdated content.

Detection: Monitor embedding freshness metrics. Track documents modified since last embedding update.

Prevention: Implement continuous re-embedding pipelines. Use change detection to prioritize re-embedding.

Tool Pitfalls

Infinite tool loops. Model calls tool, result is ambiguous, model calls same tool with slightly different parameters, repeat forever.

Detection: Monitor iteration counts. If >5 tool calls per query is common, you have a problem.

Prevention: Hard iteration limits. Better tool error messages. Consider if the task is too complex for tool-based execution.

Parameter hallucination. Model invents plausible-sounding parameters that don't match your schema.

Detection: Track tool call validation failures. If >10% of calls have invalid parameters, your schema is unclear.

Prevention: Verbose parameter descriptions with examples. Use enums for constrained values. Return specific error messages when validation fails.

Tool sprawl. You have 50 tools, the model uses 5 of them, context is wasted on 45 irrelevant definitions.

Detection: Track tool usage frequency. If 80% of tools are called <1% of the time, you have sprawl.

Prevention: Dynamic tool filtering based on query analysis. Don't pass all tools to all queries.

Cost explosion from retries. Tool call fails, model retries with different parameters, fails again, keeps trying.

Detection: Monitor total tokens per query. Sudden spikes indicate retry storms.

Prevention: Exponential backoff. Limit retries per tool. Give the model explicit instructions to give up after N attempts.

MCP Pitfalls

Server discovery overhead. Querying capabilities from 10 servers before every request adds seconds of latency.

Detection: Measure time-to-first-response. If >50% of latency is pre-generation, discovery is too expensive.

Prevention: Cache capability lists. Refresh only when servers signal changes or on fixed intervals.

Protocol version hell. Different MCP servers implement different protocol versions. Client can't negotiate correctly.

Detection: Protocol handshake failures. Missing expected fields in responses.

Prevention: Implement version negotiation. Gracefully degrade when servers don't support required features.

Authorization fragmentation. Each MCP server has its own auth model. Managing credentials and permissions becomes a nightmare.

Detection: Auth failures across multiple servers. Inconsistent permission models.

Prevention: Centralized credential management. Standardize on OAuth or similar across all servers.

Debugging blackbox. Protocol abstracts away what's actually happening. When things fail, you can't see why.

Detection: High error rates with no clear patterns.

Prevention: Comprehensive logging at protocol layer. Trace IDs across client-server interactions. Observability tooling specifically for MCP.

When to Combine Patterns (And When Not To)

Some problems legitimately require multiple patterns. Most don't.

RAG + Tools makes sense when you need both context retrieval and action execution. Example: "Analyze recent customer feedback and send a summary to the product team."

Retrieve feedback (RAG) → Analyze with LLM → Send email (Tool)

Tools + MCP works for platforms with dynamic capabilities and complex operations. Example: "Connect to the user's data warehouse and run analytics queries."

Discover available data sources (MCP) → Execute specific queries (Tools)

RAG + Tools + MCP is almost always overengineered. If you're using all three, you probably haven't clearly defined your problem boundaries.

Most production systems should pick one primary pattern and use others sparingly for edge cases.

Summary & Next Steps

The decision tree is simpler than it appears:

If your problem is "answer questions using specific documents," use RAG.

If your problem is "execute operations based on user intent," use Tools.

If your problem is "work with capabilities that vary per user or context," use MCP.

Everything else is optimization details.

Start with the simplest pattern that might work. RAG is easiest to implement. Tools add complexity but give you more control. MCP is the most complex but handles dynamic environments best.

Don't build for scale you don't have. A system handling 10 queries per hour doesn't need the architecture of a system handling 10,000.

Measure what matters: accuracy, latency, cost per query, error rates. Optimize the bottleneck, not the architecture diagram.

Next steps: Pick one pattern. Build the minimal implementation. Put it in production. Measure failure modes. Only then should you consider combining patterns or switching approaches.

The best LLM system is the one that ships and works, not the one with the cleverest architecture.