Post

Tool Use in Agentic AI: A 2025 Systems Overview

Tool Use in Agentic AI: A 2025 Systems Overview

The past year has seen a strategic shift in AI systems: instead of solely scaling up model parameters, the industry has pivoted toward tool-augmented AI systems that extend capabilities through external functions and APIs. This shift represents more than an incremental improvement—it’s a paradigm change that’s reshaping how we think about intelligent systems. Improvements in an agent’s performance now stem largely from how well it uses tools – modern models extend their “horizons” by calling external APIs or knowledge bases when needed.

The Tool Schema Foundation

In the context of agentic AI, a tool is typically a function or API that the agent can invoke to extend its capabilities beyond language modeling. Each tool is defined by a name, a description of its purpose, a set of parameters it accepts, and an output or return format. This metadata acts as a contract between the agent and the tool. For instance, using OpenAI’s function calling format, a robust tool in agentic AI requires three critical components:

  1. Clear Metadata Contract: Precise descriptions that help the AI understand when and how to use the tool
  2. Strict Parameter Validation: JSON Schema or Pydantic models to ensure well-formed inputs
  3. Self-Documenting Interface: Comprehensive documentation of side effects and return formats
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "type": "function",
  "name": "send_email",
  "description": "Send an email to a specified recipient with a subject and message.",
  "parameters": {
    "type": "object",
    "properties": {
      "to": {"type": "string", "description": "Recipient email address"},
      "subject": {"type": "string", "description": "Email subject line"},
      "body": {"type": "string", "description": "Email message body"}
    },
    "required": ["to", "subject", "body"],
    "additionalProperties": false
  }
}

In practice, the name and description inform the AI when to use this tool, and the parameters specify how to call it.

Best Practice: Tool metadata is so crucial that even slight ambiguities can lead to AI misuse or complete tool abandonment. Always validate tools with sample prompts to ensure discoverability and usability.

The TAO Loop: Think → Act → Observe

At the heart of every effective agent lies the Think-Act-Observe (TAO) loop—an iterative reasoning cycle that enables complex problem-solving. This loop is an implementation of the ReAct paradigm (Reason + Act): the agent thinks about what to do next, acts by invoking a tool or giving an answer, then observes the result, which informs the next cycle. Modern agent SDKs implement this loop under the hood, allowing the agent to autonomously step through a task until completion. Conceptually, each iteration works as follows:

  1. Think: The agent reflects on the current goal and formulates the next step
  2. Act: Based on that plan, the agent either calls a tool or provides a final answer
  3. Observe: The agent processes the result and uses it to inform the next iteration

In code, an agent loop typically looks like:

1
2
3
4
5
6
7
8
9
10
while True:
    thought = model.predict(next_prompt)  # Think: LLM proposes an action
    if thought.contains_tool_call():
        tool_name, tool_args = thought.parse_tool()
        result = execute_tool(tool_name, tool_args)  # Act: execute the tool
        append_observation(result)  # Observe: add result to context
        continue
    else:
        final_answer = thought  # No tool call, task is complete
        break

This pattern, implemented by frameworks like OpenAI’s Agents SDK and LangGraph, transforms one-shot responses into adaptive problem-solving sessions. The TAO loop enables:

  • Error Correction: If a tool response is unexpected, the agent can adjust its approach
  • Dynamic Planning: Agents can modify their strategy based on intermediate results
  • Complex Task Decomposition: Multi-step problems become manageable through iterative execution

High-level frameworks hide this complexity. For instance, the OpenAI Agent Runner will repeatedly call the LLM and automatically handle tool execution and looping until the LLM returns a final_output. Likewise, LangGraphs utilities can construct a ReAct loop as a graph of nodes (one node for the LLM reasoning, another for tool execution), cycling until a solution is reached. The key idea is that agents iteratively reason and take actions, rather than producing a one-shot answer. This enables error correction and dynamic planning: if a tool’s response is unexpected or a step fails, the agent can adjust its plan in the next Think step. The TAO loop thus imbues the system with a trial-and-error adaptability resembling human problem-solving, which is crucial for tackling complex, long-horizon tasks.

Scalable Architectural Patterns

As AI agents evolve to handle increasingly complex tasks, their reliance on external tools grows in tandem. From API endpoints to calculators, search engines, or other specialized agents, modern AI systems need scalable, intelligent mechanisms for selecting, invoking, and learning from tool interactions. The challenge isn’t just about having access to tools—it’s about intelligently orchestrating them at scale. An agent might have access to hundreds or thousands of potential tools, each with different interfaces, capabilities, and use cases. Without proper architectural patterns, this complexity can quickly become unmanageable, leading to poor performance, unreliable outputs, and maintenance nightmares. Below are eight proven architectural patterns that enable agents to manage large tool ecosystems while maintaining performance, interpretability, and adaptability.

1. Retriever-Augmented Selection

This pattern adapts Retrieval-Augmented Generation (RAG) to tool selection, allowing agents to dynamically retrieve the most relevant tools based on task context. Rather than overwhelming the agent with every available tool, this approach provides intelligent filtering based on the current task requirements.

  • User queries or instructions are embedded into a vector space
  • The agent retrieves top-k tools from a vector store populated with tool metadata, descriptions, or past usage logs
  • Supporting documentation and usage examples are also retrieved to inform tool invocation
  • This enables agents to work with thousands of tools without context overload

The key advantage is scalability—as the tool ecosystem grows, the agent doesn’t suffer from choice paralysis or context window limitations (Example: Toolshed, Tulip Agent (paper)).

2. Toolformer-Style Dynamic Routing

This architecture teaches the model when and how to use tools by integrating tool calls during training or fine-tuning. Instead of relying on complex prompt engineering, the model learns to naturally incorporate tool usage as part of its generation process.

  • The model predicts tool/API usage inline as part of token generation
  • Tool outputs are fetched and injected back into the generation stream
  • This reduces prompt engineering overhead and enables seamless tool use at inference time

This pattern creates more natural, fluent tool integration that feels less mechanical than traditional prompt-based approaches (Example: Meta’s Toolformer).

3. Agents-as-Tools: Modular Agent Composition

In this pattern, agents themselves are treated as callable tools, enabling higher-level orchestration and modularity. This creates a hierarchical system where specialized agents can be composed to solve complex problems.

  • A parent agent can invoke a specialized child agent (e.g., summarizer, calculator)
  • Each child agent runs its own reasoning loop, internal memory, or context window
  • This encourages reusability, encapsulation, and easier auditing

The modularity allows for better separation of concerns and makes it easier to debug, test, and improve individual components of the system (Example: LangGraph, AutoGen, MetaGPT (GitHub)).

4. ReAct + Scratchpad Execution Loop

Combining reasoning and acting (ReAct), this loop allows agents to interleave tool usage with chain-of-thought reasoning. This pattern provides transparency into the agent’s decision-making process while maintaining flexibility in execution.

  • The agent writes “Thought → Action → Observation” steps
  • Tool invocations are inserted in the loop as actions
  • Observations from tools are saved in a scratchpad to guide future steps

This pattern excels at complex, multi-step problems where the agent needs to reason about intermediate results before proceeding.

5. Planner-Executor Split: Task Decomposition and Delegation

A classic architectural separation where planning and execution responsibilities are split across sub-agents or modules. This division of labor allows for more sophisticated task handling and better error recovery.

  • The planner breaks down tasks and decides which tools are needed
  • The executor invokes those tools and returns results
  • Useful for multi-step workflows, long-horizon reasoning, and robustness

This separation allows the planner to focus on high-level strategy while the executor handles the tactical implementation details (Example: AutoGPT, BabyAGI, CrewAI).

6. Tool Memory Modules: Experience-Aware Tool Use

A persistent memory of tool usage improves decision-making by referencing past successful interactions. This pattern enables agents to learn from experience and avoid repeating mistakes.

  • Stores inputs, outputs, and contexts for prior tool calls
  • Retrieved to inform tool selection and argument formatting
  • Helps agents avoid repeating failed calls or misusing APIs

Memory modules can significantly improve agent performance over time by building up a knowledge base of successful tool interaction patterns (Example: ToolLLM).

7. Verified Tool Use via Feedback Channels

To increase reliability, some systems incorporate feedback from tool outputs. This pattern adds a verification layer that helps ensure tool calls are successful and meaningful.

  • Tools return structured responses including execution status or confidence
  • Agents use this feedback to retry, revise, or re-plan
  • Adds interpretability and safety, especially in high-stakes domains

This pattern is particularly valuable in production environments where reliability and error handling are critical.

8. Self-Created Tools: Agents That Build Their Own Tools

Advanced architectures support agents that generate new tools on the fly based on the needs of the task. This represents the cutting edge of adaptive AI system and will be discussed in details in the next section.

Autonomous Tool Creation - Agents That Create Tools

A frontier that has quickly moved from speculation to reality is autonomous tool creation – agents that can build new tools on the fly to solve a problem. Traditionally, an AI agent could only use tools that human developers had predefined for it, which is a bottleneck in domains needing highly specialized functions. The ToolMaker project exemplifies this new capability. It introduces an agent framework where, given a short description of a task and a link to a relevant code repository (e.g., a GitHub repo accompanying a research paper), the agent can install dependencies, read the code, and synthesize a new tool (a callable function) to perform that task. Crucially, ToolMaker doesn’t stop at generating code; it uses a closed-loop self-correction mechanism to iteratively run the code, catch errors, and refine the implementation. This addresses the common situation where an LLM’s first attempt at writing code might not work perfectly. By debugging itself (leveraging unit tests or error messages as feedback), the agent gradually arrives at a correct and robust tool.

The authors evaluated this with TM-BENCH - 15 complex tasks across domains like medical analysis and finance, each verified by 100+ unit tests. Results: ToolMaker achieved 80% task success versus only ~20% for previous agents using human-written tools. This four-fold improvement represents a major step toward AI systems that “learn to fish” (write programs) rather than just use predefined tools.

Despite ToolMaker’s impressive results, two key challenges remain: code quality dependency (agents struggle with outdated or poorly documented repositories) and security risks (executing arbitrary code requires careful sandboxing). However, the potential is transformative - future agents could automatically discover, combine, and create tools in real-time when facing new problems. This would be particularly valuable in scientific research, eliminating the need to wait for human developers to build custom analysis pipelines. Ultimately, autonomous tool creation frees AI agents from pre-built limitations, enabling nearly unlimited problem-solving capabilities.

Tool Lifecycle: Design, Observability, and Governance

Introducing a tool into an AI agent’s arsenal isn’t a one-and-done event – it involves a lifecycle from inception to eventual deprecation. It starts with design & development: identifying the need, specifying the API (as discussed, with a clear schema), and implementing the backend logic or integration. Once deployed, a tool enters the usage and maintenance phase. Here, observability is paramount.

Developers should instrument agents to log every tool call (timestamp, inputs, outputs, latency, errors) to inspect each step of the workflow. This kind of telemetry helps answer questions like: Which tools are used most?, How often do they fail or return no result?, Did the agent’s use of the tool produce the intended effect? By monitoring these, we can detect when a tool’s performance degrades or if the agent is misusing it. Observability data might show, for instance, that the send_email function failed 5% of the time due to invalid addresses – prompting improvements in validation or agent instruction.

Another best practice is designing tools to be side-effect conscious. Agents, especially those using the TAO loop, might call a tool multiple times (even erroneously). If a tool action has irreversible side effects (e.g. transferring money or sending an email), consider adding safeguards: perhaps the tool checks if the action was already taken (to avoid duplicate emails), or requires confirmation from a human-in-the-loop for destructive operations.

Over time, tools may need updates or deprecation. An API might change, or a better tool becomes available. Because agents are sensitive to the tool schema and behavior, any change requires retraining or re-prompting the agent about the new usage. One governance approach is versioning: introducing a new tool version (with a new name or version field) and gradually phasing out the old one. During this transition, observability again helps – you can watch if/when the agent switches to the new tool. If the agent hesitates to use the new version, it might indicate the description or examples need refining.

Governance also means controlling how and when an agent can use tools. In a production setting, not every tool should be available to every agent or query. It’s wise to enforce permissioning and auditing. For instance, an enterprise agent might have a tool to access customer data, but that tool should only be enabled for queries coming from authorized users or containing certain triggers. Some frameworks implement capability-based security, where the agent is issued a capability token that encodes which tools (or which data resources) it can access. The tool’s backend will validate this token on each call, preventing misuse even if the agent prompt is compromised. Moreover, all tool usage can be logged to an audit trail (who/what prompted the action, and what was done). Such governance measures ensure that as agents become more autonomous, they remain compliant with safety, privacy, and business rules.

Measuring Effectiveness - Tool Use Perspective

As agentic systems become more complex, evaluating their performance and reliability requires going beyond traditional NLP metrics. It’s no longer enough to look at an answer’s accuracy – we need to assess how the agent arrived there and what it did along the way. In this article, we evaluate how effectively the agent uses its tools. This breaks down into several sub-metrics:

  1. Tool Selection Accuracy: Does the agent choose an appropriate tool for the job when one is available? Some evaluation frameworks log each decision point where a tool could have been used and check if the agent’s choice matches an expectation.
  2. Parameter Accuracy:When the agent calls a tool, are the arguments correct? For instance, if the user asked to email Bob about a refund, did the agent call send_email with Bob’s email and a message about refunds? Mis-formulated parameters can lead to failure even if the right tool was picked.
  3. Tool Efficacy: This measures the outcome of tool calls. One can track the success rate of each tool (e.g., how often a database query returns results or how often a web search yields an answer the agent uses). If a tool frequently returns nothing useful and the agent loops trying it again and again, that’s a problem. A low efficacy might indicate either the tool is not implemented well or the agent is using it incorrectly.
  4. Tool Call Frequency and Cost: It’s useful to measure how many tool calls an agent makes to solve a query and the cost associated (in latency or API usage). An “efficient” agent might solve a task with 2 tool calls, whereas a less efficient one takes 10 calls (e.g., making many unnecessary web searches). Too many calls could indicate the agent is thrashing or stuck in a loop. Some benchmarks or evaluations impose a cap on calls and measure success under that constraint.

Deploying Agents - Building Trustworthy Systems

Deploying agents in real-world applications raises important security and isolation questions. By design, a tool-using agent is executing actions (queries, code, transactions) on behalf of a user, which means there’s potential for things to go wrong – either accidentally or via malicious prompting. Several patterns and best practices help mitigate these risks:

  1. Capability-Based Security: One way to control an agent’s powers is through capability tokens. Instead of giving the agent unlimited access to every tool or data source, the system issues a scoped token that the agent must present when calling a tool. If the agent tries to call with no token or with insufficient rights, the call is refused. This ensures that even if the agent’s prompt is hijacked or it “decides” to do something out of scope, the action is blocked at the policy level.

  2. Tenant Isolation: In multi-tenant applications (common in SaaS), we must ensure an AI agent serving Tenant A never leaks or acts on data from Tenant B. Large language models themselves are ill-suited to enforce this, as they can be tricked by prompt injections to output whatever they know. Therefore, isolation should be handled outside the model. One approach is the silo model: run a separate instance of the agent (and its tool backend) for each tenant. This could mean each tenant has its own vector database, its own API keys, maybe even its own fine-tuned model. This provides strong isolation at the cost of more resources. Another approach is the context-based isolation: use a shared infrastructure but always prepend a “tenant context” to queries and filter data by that context. For example, every tool call implicitly includes tenant_id, and the tool implementation checks it (much like multi-tenant database rows have a tenant ID column). In practice, many systems use a hybrid: coarse isolation at a high level (each tenant’s requests run in a logically separated workflow or container) plus fine-grained policy inside tools to double-check.

  3. Sandboxing and Execution Control: Tools that execute code (like a Python interpreter, shell command tool, etc.) should be sandboxed. For example, OpenAI’s Code Interpreter runs in a secure, firewalled environment with strict resource limits – so even if the agent tries something malicious or uses too much memory, it can’t affect the host system. When building your own execution tools, use containers or VMs with restricted permissions. Additionally, impose timeouts and step limits: an agent shouldn’t run an infinite loop script or connect to arbitrary external servers unless explicitly intended. Sandboxing also extends to things like web browsing tools – you might implement a proxy that the agent’s web search goes through, which strips away scripts or blocks certain websites. This prevents an agent from inadvertently fetching malware or inappropriate content during its actions.

  4. Audit and Observability: For any action an agent takes that has real-world effects (updating a record, sending an email, executing a transaction), it’s wise to keep an audit log associating the action to the initiating user, the agent, and ideally the chain-of-thought that led to it. Privacy considerations must be balanced here (since logs might contain user data), but from a security standpoint, traceability is a friend: it turns the agent from a black box into a glass box that can be inspected after the fact.

  5. Validation and Formal Safety Checks: An emerging idea is to apply formal methods or rule-based validators to agent actions before execution. For example, if an agent crafts a SQL query via a tool, one could run that query through a validator that checks it against a schema and policy (e.g., no DROP TABLE statements if not allowed, or limit result row counts to prevent data exfiltration).

In summary, securing agentic AI is about defense in depth. By applying the same rigor of security engineering that we do for traditional software, we can enjoy the productivity of autonomous agents without unwelcome surprises. Security teams need to be involved early when introducing AI agents and treat tools as extension points that need the same security review as any API endpoint in their product.

Future Frontiers: What’s Next?

Looking ahead, research and development in agentic AI is actively exploring several exciting (and challenging) frontiers:

  1. Self-Discovering Toolchains: We can expect agents to become more autonomous in not just creating tools, as ToolMaker does, but also in discovering and selecting tools on the fly. An agent might be able to read API documentation or a registry of third-party tools and incorporate a new API into its plan at runtime. Imagine an AI developer agent that, when tasked with something beyond its current toolbox, automatically searches an API marketplace or open-source repository, finds a suitable tool, and integrates it (with minimal human intervention).

  2. RL-Optimized Policies: A lot of current tool-use logic is guided by prompting and heuristic rules. An emerging approach is to apply reinforcement learning (RL) or other optimization techniques to learn the optimal policy for when and how to use tools. RL could also tune parameters of tool use (e.g., how many search results to retrieve) and refine the chain-of-thought to maximize long-term success. The combination of LLMs with learning algorithms is tricky (due to the cost of simulations and the complexity of language state), but it holds promise for fine-tuning agent behavior beyond what prompt engineering can achieve. In production, this might manifest as adaptive agents that improve their tool-use policy based on real user feedback – essentially an online reinforcement learning loop where success with users is the reward.

Tool use in agentic AI isn’t a temporary enhancement—it’s the fundamental architecture of intelligent systems. By aligning with how humans solve problems (leveraging tools, collaborating with specialists, building on past work), tool-augmented agents provide a scalable framework for real-world problem-solving.

The evidence is clear: in 2025’s AI landscape, success comes not from the largest models, but from the most thoughtfully designed tool ecosystems. The journey from model outputs to meaningful autonomous action runs directly through intelligent tool use.


Reading Sources

This post is licensed under CC BY 4.0 by the author.