← Back to Blog

When Tool Documentation Falls Short: Teaching Agents to Learn From Interaction

At Samaya AI, we build agents for financial workflows that harness a wide range of tools to produce precise, analyst-grade research. In the process, we found a recurring problem: capturing every behavioral nuance of a tool is difficult. This problem compounds as new tools are added and the guidance for combining them requires constant maintenance. Even small gaps in documentation can cause agents to misuse tools, waste tokens, or produce lower-quality answers. Yet today’s benchmarks largely fail to capture this failure mode.

Samaya's ML team together with Skyler Hallinan, Prof. Xiang Ren and Prof. Sai Praneeth Karimireddy from the University of Southern California, address this gap with ToolObserver, a method that guides the model to interact with the tools themselves and iteratively refine its understanding of tool behavior. We're also introducing OpaqueToolsBench, a benchmark designed to evaluate agents in settings where tool documentation is incomplete or imperfect.

On OpaqueToolsBench, ToolObserver outperforms state-of-the-art methods by 18.6% on average, while using 3.5–7.5× fewer tokens during inference.

Our work was accepted and will be presented at the ACM Conference on AI and Agentic Systems (ACM CAIS 2026). The paper, code and dataset is now publicly available [paper] [code & data].

The Tool Documentation Blind Spot

Consider a query a financial analyst might ask: “Why did Company X’s stock move after earnings, and is the reaction justified by the underlying fundamentals?”

Answering this well requires an agent to sift through filings, transcripts, market data, search tools, and internal knowledge sources using a broad set of tools. Each tool may appear straightforward in isolation, but in practice they carry assumptions and edge cases that can materially change the answer.

  • Overlapping tools: The agent may fetch “revenue” from both a filings retriever and a market data API, but the former may return reported GAAP revenue while the latter returns a standardized or restated metric.
  • Incomplete retrieval:  The agent may rely on a search tool which silently omits relevant passages, either because its index doesn't cover certain sources, or because a relevance filter dropped them.
  • Outdated documentation: The agent may rely on an internal knowledge source for prior research, but that source may no longer be supported even though the documentation still suggests it is.
  • Missing behavioral nuance: The agent may call a market data endpoint that appears to provide live prices, but in practice returns data with a one-hour delay.

When these behaviors are missing from the documentation, the agent can appear to follow the right research process while making the wrong analytical judgment. It may compare incompatible revenue figures, miss the passage that explains margin pressure, treat an absent result as absent evidence, or explain a stock move using stale market data.

This is not unique to finance. In almost all real-world scenarios, agents depend on tools that behave in ways no specification can fully capture. The gap may come from hidden assumptions, edge cases, changing systems, or interactions between tools. When agents cannot learn these nuances, they may execute the right steps and still reach the wrong answer.

Figure 1: Even when an agent follows the right workflow, hidden tool mismatches can compound into the wrong analytical answer.

ToolObserver: Improving Documentation via Interaction

To address these issues, we propose ToolObserver, an approach that starts from a simple premise: when tool documentation is incomplete, agents should learn from the tool itself.

ToolObserver learns from the agent’s actual task trajectories. It observes the sequence of reasoning steps, tool calls, outputs, errors, and final outcomes, then uses an editor model to refine the tool descriptions. The loop is straightforward:

  • Explore: Use the current documentation to solve realistic tasks and collect tool-use trajectories.
  • Reflect: Inspect the agent’s successes, failures, and tool misunderstandings, then propose updates to the tool documentation.
  • Update: The refined documentation replaces the previous version, correcting inaccurate claims, surfacing prerequisites, and recording observed failure modes.
  • Iterate: The agent runs again with the updated documentation. This is repeated until no further documentation updates occur.
Figure 2: ToolObserver learns from successful and failed tool-use trajectories, then updates the docs to reflect how those tools work together in practice.

We find that prior approaches either lack iterative refinement or incur prohibitive costs from extensive exploratory interactions. ToolObserver avoids both extremes: it updates documentation iteratively, but does so from the tool interactions that naturally arise while solving the task. The result is a more efficient way to turn opaque tools into usable tools.

OpaqueToolsBench: A Benchmark for the Real World

Most tool-use benchmarks evaluate agents with clean, complete, and accurate tool descriptions. That makes them useful for testing tool selection, but less useful for testing what happens in real deployments, where tools are often under-documented, outdated, or behaviorally opaque.

As a result, evaluating an agent’s ability to operate under these conditions is challenging because success depends not just on composing tools correctly, but on learning how tools behave through interaction.

To evaluate this setting, we are introducing OpaqueToolsBench, a benchmark designed around a simple question: can an agent learn how to use tools when the documentation is incomplete, misleading, or missing?

OpaqueToolsBench measures success through three complementary environments:

  1. The BFCL-Opaque environment adapts tasks from The Berkeley Function Calling Leaderboard (BFCL) [1] and removes or obscures the usual helpful documentation. The agent may need to infer what a function does, what arguments it expects, and how to call it correctly.
  2. The Chess environment gives the agent several move-suggestion tools with identical-looking interfaces, but different hidden behaviors. Unbeknownst to the agent, some tools are stronger than others, or better at playing different phases of the game. The agent has to learn which tool to trust and when.
  3. The BrowseComp Domains environment builds on the BrowseComp Plus benchmark [2] by giving the agent multiple anonymous search tools, each specialized for different kinds of information. To answer hard questions, the agent must figure out which search tool covers which domain, how to query it, and how to combine tools across a longer trajectory.

Together, these environments test whether agents can move beyond static tool invocation and instead adaptively learn to use imperfectly documented tools. For full dataset and evaluation details, we refer readers to the paper.

On OpaqueToolsBench, ToolObserver outperforms state-of-the-art methods by 18.6% on average, while using 3.5–7.5× fewer tokens during inference.

Interestingly, the strongest gains come in the most underspecified settings, where the agent starts with little more than anonymous tool names and has to recover both what each tool does and how to call it. This suggests that tool behavior can be learned operationally, reducing the need to specify every nuance perfectly in advance.

Figure 3: ToolObserver delivers the strongest results across all three OpaqueToolsBench settings, consistently outperforming existing methods.

The Best Agents Learn From Their Tools

Tool documentation should not be treated as a static artifact. As agentic systems move into high-stakes workflows, documentation becomes part of the runtime: something agents read, test, revise, and rely on to make decisions.

ToolObserver points toward that future. Instead of requiring every tool to be perfectly specified upfront, it lets agents build operational knowledge through experience. For financial research, this means agents that are not just better at calling tools, but better at understanding when a tool should be trusted, questioned, or avoided altogether.

Acknowledgements

Skyler Hallinan led this research work during his internship at Samaya AI, under the mentorship of Thejas Venkatesh, Ashwin Paranjape, Yuhao Zhang and Jack Hessel, together with guidance and collaboration from Prof. Xiang Ren and Prof. Sai Praneeth Karimireddy from the University of Southern California.

Special thanks to members of the Samaya AI team, including Bram Mulders, Kyle Chang and Paddu Raghavan, for their support on infrastructure setup and design improvements.

References

  1. Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, andJoseph E. Gonzalez. The Berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025.
  2. Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green,Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong,Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, and Jimmy Lin.Browsecomp-Plus: A more fair and transparent evaluation benchmark of deep-research agent. 2025. URL https://api.semanticscholar.org/CorpusID:280565737.