An LLM that can only generate text is a fancy autocomplete engine. The moment you give it tools — the ability to call functions, query APIs, execute code, read databases — you get something categorically different. You get an agent.
Tool use is the architectural boundary between "chatbot" and "agent." It's also where most production systems break down. Not because tool calling is hard to implement. Because the ecosystem has exploded into four or five distinct patterns, each with different trade-offs, and most teams pick one without understanding why.
This article is a map. Local tools, API-based tools, plugin tools, MCP, stateful tools — what they are, when they make sense, and where they'll burn you. We'll use LangChain throughout, since that's where most of this runs in practice. And we'll cover tool configuration and error handling, because that's where the bugs live.
graph LR
A([Agent]) --> B[Local Tools\nfast · in-process · no isolation]
A --> C[API-Based Tools\nnetwork · shared · fault-isolated]
A --> D[Plugin Tools\nthird-party · provider-managed]
A --> E[MCP Servers\nprotocol-standard · cross-framework]
A --> F[Stateful Tools\npersistent session · high risk]
style A fill:#4A90E2,color:#fff,stroke:#2c6fad
style B fill:#6BCF7F,color:#fff,stroke:#4aad61
style C fill:#6BCF7F,color:#fff,stroke:#4aad61
style D fill:#6BCF7F,color:#fff,stroke:#4aad61
style E fill:#6BCF7F,color:#fff,stroke:#4aad61
style F fill:#FFA07A,color:#fff,stroke:#cc6040
Each of these is a different answer to the same question: where does tool logic live, and who owns it? The right answer depends on your scale, your team structure, and your tolerance for operational complexity.
How LangChain Structures LLM Interactions
Before tools make sense, you need to understand how LangChain models the conversation itself.
Every interaction with an LLM is a sequence of messages. LangChain formalizes this as a list passed to the model, and there are two main types you'll work with constantly:
HumanMessage— input from the user or orchestration layerAIMessage— the model's response, which may include text, tool call requests, or both
There are others — most importantly ToolMessage, which carries the result of a tool execution back into the context. (FunctionMessage served a similar role in older LangChain versions but is deprecated; use ToolMessage in current code.) These two message types are your core loop.
from langchain_core.messages import HumanMessage, AIMessage, SystemMessagemessages = [ SystemMessage(content="You are a helpful assistant."), HumanMessage(content="What's the weather in Mumbai?"),]
When a model decides to call a tool, the AIMessage it returns doesn't contain the final answer — it contains a tool_calls field describing which tool to invoke and with what arguments. Your application is responsible for executing that tool and appending a ToolMessage with the result. Then the model gets the full context and generates its final response.
This message-passing loop is how LangGraph manages agent state. If you want the full picture of how this fits into a stateful graph, From LLMs to Agents: The Mindset Shift Nobody Talks About covers that transition in depth.
Binding and Invoking Tools
Tools get attached to a model using .bind_tools(). This tells the model what functions are available, what their inputs look like, and how to call them. The model doesn't execute tools — it requests them. Your code executes them.
from langchain_openai import ChatOpenAIfrom langchain_core.tools import tool@tooldef get_weather(city: str) -> str: """Get the current weather for a given city.""" return f"The weather in {city} is 32°C and humid."model = ChatOpenAI(model="gpt-4o")model_with_tools = model.bind_tools([get_weather])response = model_with_tools.invoke([HumanMessage(content="Weather in Mumbai?")])print(response.tool_calls)# [{'name': 'get_weather', 'args': {'city': 'Mumbai'}, 'id': 'call_abc123'}]
The @tool decorator does three things: it registers the function as a tool, extracts the schema from the type annotations, and uses the docstring as the tool's description. That description is what the model reads to decide whether and when to call the tool. Write it carefully.
Local Tools
Local tools are functions that run in the same process as your agent. No network call, no API key, no external dependency — just Python.
They're the default starting point for most agentic systems, and they cover a surprisingly wide range of use cases: data transformation, calculations, string processing, in-memory state manipulation, calling internal modules, wrapping SDK clients that are already initialized.
What a Local Tool Looks Like
from langchain_core.tools import toolfrom pydantic import BaseModel, Fieldimport jsonclass SearchInput(BaseModel): query: str = Field(description="The search query to look up") max_results: int = Field(default=5, description="Maximum number of results to return")@tool(args_schema=SearchInput)def search_knowledge_base(query: str, max_results: int = 5) -> str: """ Search the internal knowledge base for documents matching the query. Returns a JSON list of matching document titles and snippets. """ # Actual implementation would hit a vector store or search index results = [{"title": f"Doc about {query}", "snippet": "..."}] return json.dumps(results[:max_results])
A few things worth noting about this pattern:
The args_schema gives you explicit Pydantic validation on inputs before anything executes. If the model hallucinates a wrong type, Pydantic catches it before your function does.
The docstring is part of the tool's metadata sent to the model. It's not documentation for you — it's an instruction to the LLM. Be specific about what the tool returns, not just what it does.
The return type should be a string. Models read tool results as text. If you return a dict or list, LangChain will coerce it, but you're better off controlling the serialization explicitly.
Schema Is the Interface
A local tool's schema is its primary interface with the model. That schema includes:
- Name
- Description (from docstring)
- Input parameter names, types, and field descriptions
- Which parameters are required vs. optional
print(search_knowledge_base.name) # "search_knowledge_base"print(search_knowledge_base.description) # The docstringprint(search_knowledge_base.args_schema.schema()) # Full JSON schema
This metadata is what the model uses to decide: should I call this tool, and if so, with what arguments? Poor descriptions lead to wrong calls. Vague parameter names lead to hallucinated inputs. This is one of the most common failure modes in agentic systems — not the model being dumb, but the tool schema being unclear.
Where Local Tools Break Down
Local tools are fine for a single agent on one machine. In production, you'll hit these walls fast:
Scalability. Every agent instance loads every tool into memory. In a multi-agent setup with dozens of specialized tools, you can't selectively deploy tools to specific instances without restructuring your codebase.
Shared logic becomes a liability. Two agents needing the same tool means either a shared module (tight coupling) or copied implementation (divergence). When business logic changes, there's no central versioning, no staged rollout, no rollback — just redeploys.
No fault boundary. A bug in a local tool can take down the entire agent process. There's nothing separating tool execution from the orchestration layer.
Local tools are the right starting point. They're not the right ending point for anything that needs to run reliably at scale.
API-Based Tools
The straightforward solution to local tool limitations: move tool execution behind an HTTP boundary. Your agent calls a service; the service runs the tool; the service returns a result.
This is the same decomposition you'd apply to any distributed system. Tools become microservices. Or more commonly, they become thin wrappers around existing internal APIs your organization already maintains.
What an API Tool Looks Like
from langchain_core.tools import toolimport httpx@tooldef get_stock_price(ticker: str) -> str: """ Get the current stock price for a given ticker symbol. Returns price in USD as a string. """ response = httpx.get( f"https://api.internal.company.com/stocks/{ticker}", headers={"Authorization": "Bearer ..."}, timeout=5.0, ) response.raise_for_status() data = response.json() return f"{ticker}: ${data['price']:.2f}"
From LangChain's perspective, this is still just a tool — same @tool decorator, same schema, same invocation pattern. The implementation difference is the entire point: you get independent deployability, horizontal scaling, fault isolation (a failing service returns an error rather than crashing the agent), and a single shared implementation that every agent calls.
The trade-off is latency and operational complexity. A local tool executes in under 1ms for pure computation, or 5–50ms if it's hitting an already-initialized SDK client. Move it behind HTTP and you're looking at 50–200ms for a co-located internal service, 200–500ms for a third-party API, and 500ms+ once you add auth token refresh, retries, and cold starts. In a 10-step ReAct loop, that delta between local and API is 5 seconds of wall-clock time the user is waiting. You need auth, observability, timeouts, and circuit breakers on both ends.
For most production systems, the right model is a mix: fast utility functions as local tools, shared business logic behind APIs. The 7 GenAI Architectures article covers how this plays out in practice.
Plugin Tools
Plugin tools take the API model further: instead of building your own tool APIs, you consume third-party capabilities that model providers have pre-integrated.
The idea originated with OpenAI's ChatGPT plugin system. The model knows about a catalog of available plugins, can reason about which ones to call, and the provider handles the protocol details. It's tool use as a marketplace.
How Plugin Tools Are Built
Most plugin tools are generated from OpenAPI specs. If a service has an OpenAPI/Swagger schema, you can auto-generate LangChain tools from it without writing a single tool wrapper by hand:
from langchain_community.agent_toolkits.openapi import plannerfrom langchain_community.agent_toolkits.openapi.spec import reduce_openapi_specimport yaml, httpx# Fetch the OpenAPI specraw_spec = yaml.safe_load(httpx.get("https://api.example.com/openapi.yaml").text)spec = reduce_openapi_spec(raw_spec)# Build tools from every operation in the specfrom langchain_community.utilities.requests import RequestsWrapperrequests_wrapper = RequestsWrapper(headers={"Authorization": "Bearer ..."})agent = planner.create_openapi_agent( spec, requests_wrapper, llm=model, verbose=True,)
reduce_openapi_spec trims the spec down to what the LLM can fit in context. Each API endpoint becomes a callable tool. The model reads the spec's description fields to decide which endpoints to call and how to construct parameters.
This is the right pattern when you're integrating a third-party API that already has a well-documented OpenAPI spec: Stripe, Twilio, Notion, HubSpot. You get a full set of tools in minutes instead of writing wrappers for every endpoint.
The schema normalization problem. Different providers describe tool schemas differently. OpenAI uses one JSON schema format. Gemini uses another. Anthropic's tool input schema has its own structure. LangChain normalizes across these at the framework level — bind_tools() handles the provider-specific serialization — but if you're calling provider APIs directly, you're responsible for the translation:
# Anthropic tool schemaanthropic_tool = { "name": "get_weather", "description": "Get current weather for a city", "input_schema": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"} }, "required": ["city"] }}# OpenAI tool schema — same tool, different envelopeopenai_tool = { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a city", "parameters": { "type": "object", "properties": { "city": {"type": "string", "description": "City name"} }, "required": ["city"] } }}
Same tool, two different schemas. LangChain's @tool decorator generates one canonical definition and handles the conversion. This is one of the concrete reasons to use a framework rather than raw API calls — you write the tool once, deploy it against any provider.
The Ecosystem Today
Gemini has the most mature built-in plugins: Google Workspace, Search grounding, and code execution as first-class capabilities, not third-party wrappers. Anthropic skipped the plugin marketplace entirely — their ecosystem bet is MCP (next section). Microsoft's Phi leans into the Graph API (Office 365, Teams, Azure); strong inside that ecosystem, irrelevant outside it.
On the open-source side: LlamaIndex tool specs, Haystack components, and LangChain Hub cover common integrations (Slack, Notion, GitHub, Jira) well enough that you rarely need to write those wrappers yourself.
The hard limit across all of them: vendor lock-in. Tools built for one provider's format don't port. That's the problem MCP was built to solve.
Model Context Protocol (MCP)
MCP is a protocol specification, not a framework. It defines a standard way for LLM applications to communicate with tool servers, so that any MCP-compatible client can use any MCP-compatible server — regardless of which model or framework you're using.
The analogy that holds: MCP is to agent tools what USB-C is to device charging. Before USB-C, every device had its own cable. After USB-C, the interface is standardized and the hardware is interchangeable.
Here's the architecture at a glance before we get into the protocol details:
graph TD
LLM[LLM]
Agent[LangGraph Agent]
Client[MCP Client<br/>langchain-mcp-adapters]
S1[MCP Server<br/>GitHub]
S2[MCP Server<br/>Filesystem]
S3[MCP Server<br/>PostgreSQL]
S4[MCP Server<br/>Custom API]
LLM <-->|tool_calls / ToolMessage| Agent
Agent <-->|session management| Client
Client <-->|JSON-RPC over stdio| S1
Client <-->|JSON-RPC over stdio| S2
Client <-->|JSON-RPC over HTTP/SSE| S3
Client <-->|JSON-RPC over HTTP/SSE| S4
style LLM fill:#4A90E2,color:#fff,stroke:#2c6fad
style Agent fill:#7B68EE,color:#fff,stroke:#5a4ecc
style Client fill:#7B68EE,color:#fff,stroke:#5a4ecc
style S1 fill:#6BCF7F,color:#fff,stroke:#4aad61
style S2 fill:#6BCF7F,color:#fff,stroke:#4aad61
style S3 fill:#6BCF7F,color:#fff,stroke:#4aad61
style S4 fill:#6BCF7F,color:#fff,stroke:#4aad61
The key insight: the agent doesn't know or care what's behind each MCP server. The client discovers capabilities at runtime via tools/list. Add a new server, remove an old one — the agent adapts without a redeploy.
The Protocol
MCP defines two sides:
MCP Server — exposes capabilities: tools, resources (files, database records, API responses), and prompts. A server might expose a search_web tool, a read_file resource, and a summarize_document prompt. Servers can be local processes (stdio transport) or remote services (HTTP/SSE transport). They're language-agnostic — the spec is JSON-RPC over stdio or HTTP, so you can implement a server in Python, Go, TypeScript, or anything else.
MCP Client — the agent or orchestration layer that connects to servers, discovers their capabilities, and routes tool calls to the appropriate server. A client maintains a session with one or more servers and handles the protocol handshake, capability negotiation, and message routing.
The core message types:
tools/list— client asks server what tools it exposestools/call— client invokes a specific tool with argumentsresources/list,resources/read— client reads resourcessampling/createMessage— server can request LLM completions from the client (bidirectional)
That last one is interesting — servers can call back into the LLM. This enables genuinely recursive, multi-model workflows.
sequenceDiagram
participant User
participant Agent as LangGraph Agent<br/>(MCP Client)
participant LLM as LLM
participant Adapter as LangChain<br/>MCP Adapter
participant S1 as MCP Server<br/>(Filesystem)
participant S2 as MCP Server<br/>(GitHub)
User->>Agent: User message
Agent->>LLM: messages + bound tools
LLM-->>Agent: AIMessage (tool_calls: [read_file])
Agent->>Adapter: route tool call → read_file
Adapter->>S1: tools/call { name: "read_file", args: {...} }
S1-->>Adapter: ToolResult { content: "..." }
Adapter-->>Agent: ToolMessage
Agent->>LLM: messages + ToolMessage
LLM-->>Agent: AIMessage (tool_calls: [search_repo])
Agent->>Adapter: route tool call → search_repo
Adapter->>S2: tools/call { name: "search_repo", args: {...} }
S2-->>Adapter: ToolResult { content: "..." }
Adapter-->>Agent: ToolMessage
Note over S2,Agent: sampling/createMessage<br/>(server calls back into LLM)
S2->>Agent: sampling/createMessage { prompt: "..." }
Agent->>LLM: createMessage request
LLM-->>Agent: completion
Agent-->>S2: sampling result
Agent->>LLM: full message history
LLM-->>Agent: AIMessage (final answer)
Agent-->>User: Response
LangChain + MCP
LangChain has first-class MCP support through langchain-mcp-adapters:
from mcp import ClientSession, StdioServerParametersfrom mcp.client.stdio import stdio_clientfrom langchain_mcp_adapters.tools import load_mcp_toolsfrom langchain_openai import ChatOpenAIfrom langgraph.prebuilt import create_react_agent# Connect to a local MCP serverserver_params = StdioServerParameters( command="uvx", args=["mcp-server-filesystem", "/tmp/workspace"],)async with stdio_client(server_params) as (read, write): async with ClientSession(read, write) as session: await session.initialize() # Load all tools exposed by this server tools = await load_mcp_tools(session) model = ChatOpenAI(model="gpt-4o") agent = create_react_agent(model, tools) result = await agent.ainvoke({ "messages": [{"role": "user", "content": "List all Python files in the workspace"}] })
load_mcp_tools discovers the server's tool catalog and wraps each tool in a LangChain-compatible BaseTool. From that point, LangGraph doesn't care that these tools come from MCP — they look identical to local tools from the graph's perspective.
You can connect to multiple MCP servers simultaneously:
from langchain_mcp_adapters.client import MultiServerMCPClientasync with MultiServerMCPClient( { "filesystem": {"command": "uvx", "args": ["mcp-server-filesystem", "/workspace"]}, "github": {"url": "https://mcp.github.com", "transport": "streamable_http"}, "postgres": {"command": "uvx", "args": ["mcp-server-postgres", DATABASE_URL]}, }) as client: tools = client.get_tools() agent = create_react_agent(model, tools)
Each server is isolated. Tools from the filesystem server and tools from the GitHub server are all available to the agent with no additional integration work.
What MCP Gets Right
Interoperability. A tool server written once works with any MCP client — Claude, GPT-4, Llama, your custom orchestration layer. The ecosystem compounds.
Separation of concerns. Tool implementation is completely decoupled from agent implementation. Teams can own their MCP servers independently. Tool servers can be updated, versioned, and deployed without touching agent code.
Discovery. The client doesn't need to know tool schemas in advance. It connects to a server, calls tools/list, and gets the full catalog dynamically. This enables genuinely dynamic tool loading based on context.
Bidirectionality. Servers can request LLM completions from the client. This supports patterns like tool-level reasoning, embedded summarization, and multi-model chains that are awkward to implement with static tool definitions.
What MCP Doesn't Solve Yet
Be honest about the gaps:
Authentication and authorization are not standardized. The MCP spec doesn't define how servers authenticate clients or how fine-grained permissions work. Every server implements its own auth. There's no standard for capability tokens, scoped credentials, or revocation. If you're building security-critical systems, you need to layer auth on top — the protocol won't do it for you. This connects directly to the patterns in Credential Scoping for Agents.
Tool discovery at scale is unsolved. If your organization has 200 MCP servers, how does an agent know which ones to connect to? Static configuration doesn't scale. Dynamic discovery registries don't exist in a standardized form yet.
Error semantics are underspecified. Tool failures return JSON-RPC errors, but the protocol doesn't define retry semantics, partial failure handling, or how clients should reason about transient vs. permanent failures.
Server health and observability. There's no standard for health checks, metrics, or distributed tracing across MCP server boundaries. You're stitching together your own observability story. The patterns from Agentic AI Observability apply here, but you'll need to instrument MCP calls explicitly.
MCP is early but directionally correct. It will become the standard. Building tool servers MCP-compatible today means you're not locked into any one framework tomorrow.
Stateful Tools
Most tools are stateless: input in, output out, nothing persists. Stateful tools break that model. They maintain state across calls — session objects, database connections, in-memory caches, workflow state machines.
Why You'd Want Them
Some tool interactions are inherently multi-step. A browser automation tool needs to maintain a session across page loads. A database tool might manage a transaction across multiple queries. A file editing tool might stage changes before committing. Implementing these as multiple stateless calls creates coordination overhead and race conditions. A stateful tool wraps the complexity.
from langchain_core.tools import BaseToolfrom typing import Optionalimport psycopg2class PostgresSessionTool(BaseTool): name: str = "postgres_query" description: str = """ Execute SQL queries within a persistent database session. Maintains transaction state across multiple calls. Always call commit_transaction or rollback_transaction when done. """ _connection: Optional[psycopg2.extensions.connection] = None _cursor: Optional[psycopg2.extensions.cursor] = None def _run(self, query: str, operation: str = "query") -> str: if operation == "connect": self._connection = psycopg2.connect(DATABASE_URL) self._cursor = self._connection.cursor() return "Connected to database" if operation == "query": self._cursor.execute(query) return str(self._cursor.fetchall()) if operation == "commit": self._connection.commit() return "Transaction committed" if operation == "rollback": self._connection.rollback() return "Transaction rolled back" raise ValueError(f"Unknown operation: {operation}")
The Security Problems
Stateful tools are where the most dangerous security bugs live.
State leakage across sessions. If a stateful tool is shared across multiple concurrent agent runs (which is the natural implementation in a server), one agent's state can bleed into another. A browser session, an open file handle, a pending transaction — all can be read or modified by the wrong agent if you're not careful about isolation.
Persistent side effects from partial failures. A stateless tool that fails leaves nothing behind. A stateful tool that fails mid-workflow can leave state partially applied: a transaction open, a file locked, a session authenticated. The agent might retry from the beginning while the previous call's state is still live.
Expanded blast radius. A compromised stateful tool has access to everything in its session. A database connection tool that's been manipulated via prompt injection can execute arbitrary queries against an open connection that already has elevated permissions.
How to mitigate:
First, never share stateful tool instances across agent invocations. Create a fresh instance per agent run. This is the same principle as not sharing database connection pools across untrusted tenants.
Second, implement explicit lifecycle management and enforce it. Tools should have initialize() and cleanup() methods, and your orchestration layer should call cleanup() regardless of whether the agent succeeded or failed. LangGraph's node lifecycle hooks are useful here.
Third, scope credentials to the minimum required for each operation. An agent that needs to read from a database shouldn't be holding a connection with write permissions. This applies at every layer — the Tool Execution Firewall pattern is directly relevant when tools have persistent capabilities.
Fourth, log every state transition. Stateful tools are where audit trails become critical. You need to reconstruct exactly what state the tool was in at every point in the agent's execution. This isn't optional for anything running in production. See Agent Audit Trails: Logging Context, Not Just Actions.
Automated Tool Development
As tool catalogs grow, the engineering cost of maintaining them grows with it. One emerging pattern: use the model to generate tools, not just call them.
Foundation Models as Tool Makers
Frontier models are good at generating well-formed tool definitions from a natural language description or an existing schema. The practical use case is wrapping APIs you don't control — given an OpenAPI spec or a plain English description of what a service does, a model can produce a correctly typed, docstring-annotated Python function ready to bind.
Here's a concrete pattern: you have an internal service with a spec, and you want a tool without writing the wrapper yourself.
from langchain_openai import ChatOpenAIfrom langchain_core.tools import toolimport jsonllm = ChatOpenAI(model="gpt-4o")def generate_tool_from_description(service_name: str, description: str, example_response: dict) -> str: """Ask the LLM to produce a @tool-decorated Python function.""" prompt = f"""Write a LangChain @tool function for this internal service.Service: {service_name}Description: {description}Example response shape: {json.dumps(example_response, indent=2)}Requirements:- Use @tool decorator from langchain_core.tools- Type-annotated parameters- Docstring that describes what the tool returns, not just what it does- Return type must be str- Call httpx.get with a placeholder URL- Include basic error handlingReturn only the Python function, no explanation.""" response = llm.invoke(prompt) return response.content# Usagegenerated_code = generate_tool_from_description( service_name="inventory-service", description="Returns current stock levels for a given SKU", example_response={"sku": "ABC-123", "quantity": 42, "warehouse": "MUM-01"})print(generated_code)
The output is a starting point, not a finished tool. You still need to: review the generated docstring (it determines how the model calls the tool), replace placeholder URLs with real endpoints, and validate the schema with tool.args_schema.schema() before binding. The generation removes the boilerplate; the review gate removes the risk.
Don't skip the review step. A generated tool with a vague docstring will be called incorrectly by the same model that generated it.
Real-Time Code Generation
The more aggressive pattern: agents that write code, test it, and execute it in a loop — the "code interpreter" approach. The agent doesn't generate a reusable tool definition; it generates a one-shot script to answer a specific question.
from langchain_openai import ChatOpenAIfrom langchain_core.tools import toolllm = ChatOpenAI(model="gpt-4o")@tooldef generate_and_execute(task: str, data_context: str) -> str: """ Generate and execute a Python script to complete a data analysis task. Provide the task description and any relevant data context. Returns execution output or a descriptive error. """ code_prompt = f"""Write a Python script to: {task}Available context: {data_context}Rules:- Use only stdlib + pandas + numpy- Print results to stdout- Handle exceptions and print errors clearly- No file I/O, no network calls""" # Intentional nested LLM call: this tool itself invokes the LLM to generate code. # The outer agent decides *when* to use this tool; the inner call handles *what* to generate. # Two-level pattern — not a mistake, but document it clearly for anyone reading the code. code_response = llm.invoke(code_prompt) code = code_response.content.strip().removeprefix("```python").removesuffix("```").strip() # In production: run this in a sandboxed subprocess, not exec() import subprocess, sys result = subprocess.run( [sys.executable, "-c", code], capture_output=True, text=True, timeout=10 ) if result.returncode != 0: return f"Execution failed:\n{result.stderr[:500]}" return result.stdout[:1000]
Note the subprocess over exec() — it gives you process isolation, timeout enforcement, and stderr capture. PythonREPLTool from langchain_experimental runs code in the host process with full permissions. That's acceptable locally; in production you want a container boundary with network isolation and no credential access.
The pattern works well for: data analysis over known schemas, metric calculations, format transformations. It breaks for: anything requiring persistent state, network access to internal services, or tasks where the generated code is likely to be wrong in non-obvious ways that need human judgment to catch.
Tool Use Configuration
The tool_choice Parameter
When you bind tools to a model, you can control how aggressively the model uses them via tool_choice:
# Auto: model decides when to use tools (default)model.bind_tools(tools, tool_choice="auto")# Required: model MUST call at least one toolmodel.bind_tools(tools, tool_choice="required")# Any: alias for required in some providersmodel.bind_tools(tools, tool_choice="any")# Specific: force a specific tool callmodel.bind_tools(tools, tool_choice={"type": "function", "function": {"name": "get_weather"}})
auto is the default and the right choice for most agentic workflows. The model reasons about whether tools are needed.
required / any forces a tool call on every invocation. Use this when you know the model should always take an action — a router that must classify every input, a data extraction pipeline where every call should produce structured output, a workflow where proceeding without tool use is always wrong. The risk: the model will call something, even if no tool is appropriate. In edge cases, it'll call the closest thing it can find with nonsense arguments. Test your edge cases before deploying this mode.
Specific tool forcing is for structured extraction and workflow checkpoints where you need a guaranteed schema on every call. Combine with a Pydantic output parser for reliable structured output.
The Too Many Tools Problem
There's a limit to how many tools you can bind to an agent before reasoning quality degrades. In practice, this threshold sits somewhere around 20–30 tools depending on the model. Beyond that, models start making wrong tool selections — choosing a vaguely relevant tool over the correct one, or failing to call any tool when one is clearly needed. The problem is attention: every tool's schema consumes context, and when the list is long, the signal-to-noise ratio in the tool selection step drops.
The production fix is dynamic tool selection: don't bind all tools upfront. Select the relevant subset for each request at runtime.
from langchain_openai import ChatOpenAI, OpenAIEmbeddingsfrom langchain_community.vectorstores import FAISS# All available tools, with their descriptions as the retrieval corpusall_tools = [get_customer_profile, update_ticket_status, search_knowledge_base, calculate_sla_deadline, get_product_details, check_inventory, ...]# Build a vector index over tool descriptions at startuptool_descriptions = [ {"name": t.name, "description": t.description} for t in all_tools]tool_index = FAISS.from_texts( texts=[t["description"] for t in tool_descriptions], embedding=OpenAIEmbeddings(), metadatas=tool_descriptions,)tool_lookup = {t.name: t for t in all_tools}def select_tools_for_request(user_query: str, k: int = 8) -> list: """Retrieve the k most relevant tools for this query.""" results = tool_index.similarity_search(user_query, k=k) return [tool_lookup[r.metadata["name"]] for r in results if r.metadata["name"] in tool_lookup]# Per-request: bind only the relevant toolsdef run_agent(user_query: str): relevant_tools = select_tools_for_request(user_query, k=8) model = ChatOpenAI(model="gpt-4o").bind_tools(relevant_tools) # ... rest of agent invocation
This keeps the bound tool list to 8–10 per request regardless of catalog size. The embedding index adds ~5ms at query time — a worthwhile trade for consistent reasoning quality across a large tool catalog.
One important detail: tool descriptions need to be retrieval-optimized, not just model-optimized. A description written to tell the LLM how to use a tool may not retrieval-match the user queries that should trigger it. Write two descriptions if needed — one for retrieval, one for the model's schema — or use a separate retrieval_hint field during indexing.
When Things Break: Error Handling
Tool call failures in production follow predictable patterns. Here's the handling stack, in order:
1. Validate before the model sees it
Schema validation catches malformed inputs before they reach your tool's implementation. Define strict Pydantic schemas and let validation fail fast:
class QueryInput(BaseModel): query: str = Field(min_length=1, max_length=1000) filters: dict[str, str] = Field(default_factory=dict) limit: int = Field(default=10, ge=1, le=100)@tool(args_schema=QueryInput)def search_records(query: str, filters: dict, limit: int) -> str: """Search records matching query with optional filters.""" ...
If the model sends limit: "ten" instead of limit: 10, Pydantic rejects it before your code runs. The error message goes back to the model, which usually self-corrects on the next attempt.
2. Retry with exponential backoff
Transient failures — network timeouts, rate limits, temporary service unavailability — need retry logic. Don't implement this inside the tool; implement it at the tool invocation layer so it applies uniformly:
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_typeimport httpx@retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10), retry=retry_if_exception_type((httpx.TimeoutException, httpx.HTTPStatusError)),)def _call_api_with_retry(url: str, payload: dict) -> dict: response = httpx.post(url, json=payload, timeout=10.0) response.raise_for_status() return response.json()@tooldef my_api_tool(input: str) -> str: """Call the internal API.""" result = _call_api_with_retry("https://api.internal.com/endpoint", {"input": input}) return result["output"]
2a. Retry with feedback (LLM self-correction)
Exponential backoff handles infrastructure failures. A different failure mode needs a different fix: the model called the tool with wrong arguments. The tool ran, returned an error, and the model needs to see that error in context to correct itself. This is the retry-with-feedback loop.
LangChain's ToolMessage is the mechanism. When a tool call fails, append a ToolMessage with the error — the model reads it and retries with corrected arguments, usually successfully:
from langchain_core.messages import AIMessage, ToolMessage, HumanMessagefrom langchain_openai import ChatOpenAIimport jsonmodel = ChatOpenAI(model="gpt-4o").bind_tools(tools)def run_with_tool_feedback(user_query: str, max_retries: int = 3) -> str: messages = [HumanMessage(content=user_query)] for attempt in range(max_retries): response = model.invoke(messages) messages.append(response) if not response.tool_calls: # No tool call — final text response return response.content # Execute each tool call; catch failures and feed them back all_succeeded = True for tc in response.tool_calls: try: tool_fn = tool_lookup[tc["name"]] result = tool_fn.invoke(tc["args"]) messages.append(ToolMessage( content=result, tool_call_id=tc["id"], )) except Exception as e: # Feed the error back — model will see it and can correct messages.append(ToolMessage( content=f"Error: {str(e)}. Check your arguments and try again.", tool_call_id=tc["id"], )) all_succeeded = False if all_succeeded: continue # Let model generate next step or final answer return "Max retries reached. Could not complete the request."
The key: don't silently swallow tool errors or raise immediately. Return a descriptive ToolMessage with the error and let the model attempt self-correction. GPT-4o and Claude 3.5+ handle this well — they'll adjust argument types, fix field names, or switch to a different tool. Cap retries at 2–3; if the model can't correct after that, the problem is usually the tool schema, not the model.
3. Fall back gracefully
Not every tool failure should abort the agent. Define what the fallback behavior is for each tool and encode it explicitly:
@tooldef get_real_time_price(ticker: str) -> str: """Get real-time stock price. Falls back to last known price if unavailable.""" try: return fetch_live_price(ticker) except PriceServiceUnavailable: cached = get_cached_price(ticker) if cached: return f"{ticker}: ${cached['price']:.2f} (cached {cached['age_minutes']}m ago)" return f"Price unavailable for {ticker}. Please try again later."
Return a meaningful string the model can reason about. Don't raise an exception unless the failure is genuinely unrecoverable and you want the agent to stop.
4. Log everything
Every tool call — its inputs, its outputs, its latency, its success or failure — should be logged with the agent's run ID. This is what lets you reconstruct exactly what happened when something goes wrong in production:
import structlogimport timefrom functools import wrapslogger = structlog.get_logger()def logged_tool(func): @wraps(func) def wrapper(*args, **kwargs): start = time.monotonic() tool_name = func.__name__ try: result = func(*args, **kwargs) logger.info("tool.success", tool=tool_name, args=kwargs, latency_ms=(time.monotonic() - start) * 1000, ) return result except Exception as e: logger.error("tool.failure", tool=tool_name, args=kwargs, error=str(e), latency_ms=(time.monotonic() - start) * 1000, ) raise return wrapper# Apply to any tool — stacks cleanly with @tool@logged_tool@tooldef get_customer_profile(customer_id: str) -> str: """Fetch customer profile from CRM by customer ID.""" ...
The MCP vs RAG vs Tools article goes into when tool use is the right answer versus retrieval or direct reasoning — worth reading alongside this if you're making architectural decisions.
Comparison: Choosing Your Tool Pattern
Before the decision framework, here's the full picture side by side:
| Tool Type | Latency | Fault Isolation | Scalability | Operational Complexity | Best For |
|---|---|---|---|---|---|
| Local | Very low | None — tool crash = agent crash | Poor | Low | Computation, formatting, in-process logic |
| API-based | Medium | High — service boundary | Good | Medium | Shared business logic, centralized impl |
| Plugin | Medium | Provider-dependent | Medium | Low | Third-party commoditized capabilities |
| MCP | Medium | High — server boundary | Excellent | Medium | Multi-framework ecosystems, tool distribution |
| Stateful | Variable | Risky — shared state leaks | Hard | High | Multi-step processes requiring persistent session |
"Operational complexity" here means the surface area you're responsible for — not the difficulty of the initial implementation. A local tool is trivial to add and painful to maintain at scale. An MCP server requires more upfront work but gives you independent deployability, versioning, and the ability to reuse across any MCP-compatible agent.
The latency column deserves a note: all networked tool types (API, Plugin, MCP) add a round trip. In a multi-step ReAct loop where an agent makes 8–12 tool calls, a 100ms tool latency difference compounds into 1–1.2 seconds. Profile your tool latency early, especially for latency-sensitive user-facing workflows.
What This Looks Like in Practice
Here's how a customer support agent would actually decompose its tool layer across types:
graph LR
AGENT([Customer Support Agent])
AGENT --> LOCAL[Local Tools]
AGENT --> API[API-Based Tools]
AGENT --> MCP[MCP Servers]
AGENT --> STATEFUL[Stateful Tool]
LOCAL --> L1["format_ticket_summary(ticket)\n→ str"]
LOCAL --> L2["validate_email(email)\n→ bool"]
LOCAL --> L3["calculate_sla_deadline(priority, created_at)\n→ datetime"]
API --> A1["get_customer_profile(customer_id)\nCRM lookup · shared across agents"]
API --> A2["update_ticket_status(ticket_id, status)\nInternal ticketing API"]
MCP --> M1["mcp-server-knowledge-base\nSearch internal docs, FAQs, runbooks"]
MCP --> M2["mcp-server-incident-history\nQuery past incidents and resolutions"]
STATEFUL --> S1["BrowserSessionTool\nNavigate support portal · fill forms\nIsolated per session · explicit cleanup"]
style AGENT fill:#4A90E2,color:#fff,stroke:#2c6fad
style LOCAL fill:#6BCF7F,color:#fff,stroke:#4aad61
style L1 fill:#98D8C8,color:#333,stroke:#6ab8a8
style L2 fill:#98D8C8,color:#333,stroke:#6ab8a8
style L3 fill:#98D8C8,color:#333,stroke:#6ab8a8
style API fill:#6BCF7F,color:#fff,stroke:#4aad61
style A1 fill:#98D8C8,color:#333,stroke:#6ab8a8
style A2 fill:#98D8C8,color:#333,stroke:#6ab8a8
style MCP fill:#7B68EE,color:#fff,stroke:#5a4ecc
style M1 fill:#B8B0F0,color:#333,stroke:#8880cc
style M2 fill:#B8B0F0,color:#333,stroke:#8880cc
style STATEFUL fill:#FFA07A,color:#fff,stroke:#cc6040
style S1 fill:#FFCBA4,color:#333,stroke:#cc9060
The split is intentional, not arbitrary. Formatting and validation are pure functions — no reason to add a network hop. CRM lookup is shared across three other agents; centralizing it as an API means one place to update auth, rate limiting, and field mappings. The knowledge base and incident history are MCP servers because the platform team owns them independently, and they'll eventually be used by agents built in different frameworks. Browser automation is stateful because navigating the support portal is inherently multi-step — you can't do it with a single stateless call.
Each tool type earns its place by solving a specific problem. When you don't have that specific problem, use the simpler type.
Conclusion
Tool use is the thing that makes agents actually useful. But the ecosystem has grown fast, and most teams are building on assumptions they've never examined.
Here's where things actually stand:
Local tools are where you start. They're fast, simple, and appropriate for anything that doesn't need to scale beyond a single process. Most agents should have at least some local tools for computation, formatting, and in-process logic.
API-based tools are what production systems actually run on. You want fault isolation, independent deployability, and shared implementation. The operational overhead is real but unavoidable at scale.
Plugin tools are useful for standard integrations you don't want to maintain. Use them for commoditized capabilities — search, calendar, email — not for anything business-critical that needs your specific behavior.
MCP is the right architectural bet for anything that needs to survive framework churn. Build your tool servers MCP-compatible now. The protocol is still maturing, but the direction is clear: this is how tools will be distributed and consumed going forward.
Stateful tools are powerful and dangerous. Use them when you genuinely need persistent state across tool calls. Isolate instances per agent run. Log every state transition. Treat them with the same security posture you'd apply to any component with persistent elevated access.
Automated tool development is still early-stage. It works for specific patterns — code interpretation, OpenAPI-to-tool generation, schema-driven extraction. Don't build production systems that depend on LLM-generated tools without a human review gate somewhere in the loop.
On configuration: default to auto for tool_choice. Use required deliberately and with edge-case testing. And implement the full error handling stack — schema validation, retry with backoff, graceful degradation, and logging — before you call anything production-ready.
Which Tool Type Do You Actually Need?
flowchart TD
START([Start here]) --> Q1{Simple computation,\nformatting, or\nin-process logic?}
Q1 -->|Yes| LOCAL[Local Tool]
Q1 -->|No| Q2{Shared capability\nused by multiple\nagents?}
Q2 -->|Yes| API[API-Based Tool]
Q2 -->|No| Q3{Third-party service\nyou don't want to\nmaintain wrappers for?}
Q3 -->|Yes| PLUGIN[Plugin Tool\nOpenAPI generation\nif they have a spec]
Q3 -->|No| Q4{Tools need to work\nacross frameworks,\nteams, or providers?}
Q4 -->|Yes| MCP[MCP Server]
Q4 -->|No| Q5{Multi-step process\nrequiring persistent\nsession state?}
Q5 -->|Yes| STATEFUL[Stateful Tool\nisolation + explicit\nlifecycle management]
Q5 -->|No| LOCAL2[Local Tool\nstart simple]
style START fill:#4A90E2,color:#fff,stroke:#2c6fad
style LOCAL fill:#6BCF7F,color:#fff,stroke:#4aad61
style LOCAL2 fill:#6BCF7F,color:#fff,stroke:#4aad61
style API fill:#6BCF7F,color:#fff,stroke:#4aad61
style PLUGIN fill:#6BCF7F,color:#fff,stroke:#4aad61
style MCP fill:#7B68EE,color:#fff,stroke:#5a4ecc
style STATEFUL fill:#FFA07A,color:#fff,stroke:#cc6040
style Q1 fill:#FFD93D,color:#333,stroke:#ccaa00
style Q2 fill:#FFD93D,color:#333,stroke:#ccaa00
style Q3 fill:#FFD93D,color:#333,stroke:#ccaa00
style Q4 fill:#FFD93D,color:#333,stroke:#ccaa00
style Q5 fill:#FFD93D,color:#333,stroke:#ccaa00
Start at the top. Go down only when you have a concrete reason to. Every step down the list adds complexity — make sure it's complexity you actually need.
The agents that survive production aren't the ones with the most tools. They're the ones that know when to use each tool, handle failures without collapsing, and make every tool call observable. Build toward that from the start.
For a deeper look at how tool execution fits into the broader security architecture of agentic systems, see Zero Trust Agents and The Agent DMZ.
References
LangChain & LangGraph
- LangChain Tools documentation — https://python.langchain.com/docs/concepts/tools/
- LangChain
bind_tools()reference — https://python.langchain.com/docs/how_to/tool_calling/ - LangChain OpenAPI toolkit — https://python.langchain.com/docs/integrations/toolkits/openapi/
- LangGraph prebuilt ReAct agent — https://langchain-ai.github.io/langgraph/reference/prebuilt/
langchain-mcp-adaptersrepository — https://github.com/langchain-ai/langchain-mcp-adapters
Model Context Protocol (MCP)
- MCP specification — https://modelcontextprotocol.io/specification
- MCP Python SDK — https://github.com/modelcontextprotocol/python-sdk
- MCP server: filesystem — https://github.com/modelcontextprotocol/servers/tree/main/src/filesystem
- MCP server: postgres — https://github.com/modelcontextprotocol/servers/tree/main/src/postgres
Provider Tool Use APIs
- Anthropic tool use documentation — https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- OpenAI function calling documentation — https://platform.openai.com/docs/guides/function-calling
- Google Gemini function calling — https://ai.google.dev/gemini-api/docs/function-calling
OpenAPI Specification
- OpenAPI 3.x specification — https://spec.openapis.org/oas/v3.1.0
Libraries
- Pydantic v2 documentation — https://docs.pydantic.dev/latest/
- Tenacity retry library — https://tenacity.readthedocs.io/en/latest/
- FAISS (Facebook AI Similarity Search) — https://github.com/facebookresearch/faiss
- httpx — https://www.python-httpx.org/
- psycopg2 — https://www.psycopg.org/docs/
- structlog — https://www.structlog.org/en/stable/
ReAct Reasoning Pattern
- Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. https://arxiv.org/abs/2210.03629
RLHF / Instruction Tuning
- Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
Related Articles
- Agentic AI Observability: Why Traditional Monitoring Breaks with Autonomous Systems
- Beyond Copy-Paste: Staying Relevant in the Age of AI Code Assistants