At Samaya AI, we build agents for financial workflows that harness a wide range of tools to produce precise, analyst-grade research. In the process, we found a recurring problem: capturing every behavioral nuance of a tool is difficult. This problem compounds as new tools are added and the guidance for combining them requires constant maintenance. Even small gaps in documentation can cause agents to misuse tools, waste tokens, or produce lower-quality answers. Yet today’s benchmarks largely fail to capture this failure mode.
Samaya's ML team together with Skyler Hallinan, Prof. Xiang Ren and Prof. Sai Praneeth Karimireddy from the University of Southern California, address this gap with ToolObserver, a method that guides the model to interact with the tools themselves and iteratively refine its understanding of tool behavior. We're also introducing OpaqueToolsBench, a benchmark designed to evaluate agents in settings where tool documentation is incomplete or imperfect.
On OpaqueToolsBench, ToolObserver outperforms state-of-the-art methods by 18.6% on average, while using 3.5–7.5× fewer tokens during inference.
Our work was accepted and will be presented at the ACM Conference on AI and Agentic Systems (ACM CAIS 2026). The paper, code and dataset is now publicly available [paper] [code & data].
Consider a query a financial analyst might ask: “Why did Company X’s stock move after earnings, and is the reaction justified by the underlying fundamentals?”
Answering this well requires an agent to sift through filings, transcripts, market data, search tools, and internal knowledge sources using a broad set of tools. Each tool may appear straightforward in isolation, but in practice they carry assumptions and edge cases that can materially change the answer.
When these behaviors are missing from the documentation, the agent can appear to follow the right research process while making the wrong analytical judgment. It may compare incompatible revenue figures, miss the passage that explains margin pressure, treat an absent result as absent evidence, or explain a stock move using stale market data.
This is not unique to finance. In almost all real-world scenarios, agents depend on tools that behave in ways no specification can fully capture. The gap may come from hidden assumptions, edge cases, changing systems, or interactions between tools. When agents cannot learn these nuances, they may execute the right steps and still reach the wrong answer.
.png)
To address these issues, we propose ToolObserver, an approach that starts from a simple premise: when tool documentation is incomplete, agents should learn from the tool itself.
ToolObserver learns from the agent’s actual task trajectories. It observes the sequence of reasoning steps, tool calls, outputs, errors, and final outcomes, then uses an editor model to refine the tool descriptions. The loop is straightforward:
.png)
We find that prior approaches either lack iterative refinement or incur prohibitive costs from extensive exploratory interactions. ToolObserver avoids both extremes: it updates documentation iteratively, but does so from the tool interactions that naturally arise while solving the task. The result is a more efficient way to turn opaque tools into usable tools.
Most tool-use benchmarks evaluate agents with clean, complete, and accurate tool descriptions. That makes them useful for testing tool selection, but less useful for testing what happens in real deployments, where tools are often under-documented, outdated, or behaviorally opaque.
As a result, evaluating an agent’s ability to operate under these conditions is challenging because success depends not just on composing tools correctly, but on learning how tools behave through interaction.
To evaluate this setting, we are introducing OpaqueToolsBench, a benchmark designed around a simple question: can an agent learn how to use tools when the documentation is incomplete, misleading, or missing?
OpaqueToolsBench measures success through three complementary environments:
Together, these environments test whether agents can move beyond static tool invocation and instead adaptively learn to use imperfectly documented tools. For full dataset and evaluation details, we refer readers to the paper.
On OpaqueToolsBench, ToolObserver outperforms state-of-the-art methods by 18.6% on average, while using 3.5–7.5× fewer tokens during inference.
Interestingly, the strongest gains come in the most underspecified settings, where the agent starts with little more than anonymous tool names and has to recover both what each tool does and how to call it. This suggests that tool behavior can be learned operationally, reducing the need to specify every nuance perfectly in advance.
.png)
Tool documentation should not be treated as a static artifact. As agentic systems move into high-stakes workflows, documentation becomes part of the runtime: something agents read, test, revise, and rely on to make decisions.
ToolObserver points toward that future. Instead of requiring every tool to be perfectly specified upfront, it lets agents build operational knowledge through experience. For financial research, this means agents that are not just better at calling tools, but better at understanding when a tool should be trusted, questioned, or avoided altogether.
Skyler Hallinan led this research work during his internship at Samaya AI, under the mentorship of Thejas Venkatesh, Ashwin Paranjape, Yuhao Zhang and Jack Hessel, together with guidance and collaboration from Prof. Xiang Ren and Prof. Sai Praneeth Karimireddy from the University of Southern California.
Special thanks to members of the Samaya AI team, including Bram Mulders, Kyle Chang and Paddu Raghavan, for their support on infrastructure setup and design improvements.