2025.03.24

Evaluation of AI Agents at Samaya

Jack Hessel

Samaya is developing AI agents that dynamically compose models (e.g., retrievers, named entity disambiguators) to tackle challenging knowledge tasks that exceed the capabilities of any single component. Beyond traditional evaluations — which typically assess components individually — we are creating a set of evaluation environments designed to measure agent performance in realistic, holistic, and ambitious scenarios.

In this blog post, we’ll highlight one evaluation environment that we developed: RealityBench. We’ll see that:

Samaya-QA, our real-time question answering agent which itself orchestrates a composition of models, outperforms baselines like Grounded RAG+Claude Sonnet; and
Samaya-QAx16, an agent that scales compute by 16x compared to Samaya-QA, performs even better, exhibiting promising scaling properties.

RealityBench: Can Models Reason About the Future?

Users of Samaya often conduct research with the goal of forecasting key performance indicators (for companies, for economies, etc.). As a proxy, we built RealityBench: an evaluation that assesses whether or not our system can itself produce accurate forecasts.

A schematic of a RealityBench instance. The instance asks if more than 15K jobs will be added, and illustrates Samaya making a temporal forecast 1 week ahead into the future. — Figure 1: RealityBench challenges models to make binary forecasts about the future based on data from the past.

RealityBench consists of 972 timestamped numerical announcements, for example:

On February 7, 2025, the Bureau of Labor Statistics (BLS) reported that employment in the ‘computer systems design and related services’ sector increased by 13.7K jobs.”

We convert announcements into prediction instances by applying a small amount of random noise to the ground-truth quantity (e.g., 13.7K -> 15K), and asking agents to make a binary prediction about the value, i.e.,

True or False: the February 7, 2025 BLS report will report more than +15K jobs in the ‘computer systems design and related services’ sector” (False)

To test forecasting capacity, agents are only given access to documents/information published 1 week before the announcement, e.g., in our running example: all documents available before February 1st, 2025. Binary accuracy is the final metric: random guessing achieves 50%.

Results

We compare the performance of:

a zero-shot application of claude-sonnet-3.5 v2;¹
Grounded RAG;²
Samaya-QA --- our real-time question-answering offering; and
Samaya-QAx16: an agent that is given access to Samaya-QA as a tool, with a budget of 16 queries.

A bar chart illustrating the RealityBench accuracy of several models. They are: constant (51%), grounded RAG (52%), Claude-Sonnet-3.5 v2 (51%), Samaya-QA (62%), Samaya-QAx16 (67%) — Figure 2: RealityBench results.

Overall, the results demonstrate that:

models can meaningfully reason about the future given data from the past;
Samaya-QA outperforms baselines; and
providing higher compute budgets (i.e., moving from Samaya-QA -> Samaya-QAx16) demonstrates promising scaling properties.

We qualitatively examine the reasoning traces of Samaya-QAx16, and find that they are interpretable, e.g., for claims like the BLS example above, Samaya-QAx16 makes a calibrated prediction about the employment statistic by organizing its process into a “bear” and “bull” case; researching recent industry trends/conditions; investigating announcements, updates, and commentary from major players; and finally, integrating its findings into a well-formatted and succinct summary.

Samaya-QAx16 is one of several agents we offer on the Samaya platform today!

Scaling Agents beyond RealityBench

We have observed promising scaling properties of Samaya-QAx16 on evaluation environments beyond RealityBench as well. In a future blog post, we will discuss the details of CriteriaEval, which measures domain-specific question answering and search performance by leveraging a set of expert-curated annotations. Here’s a teaser, illustrating similar scaling trends (higher=better).

A bar chart illustrating the CriteriaEval accuracy of several models. They are: grounded RAG (13.8), Claude-Sonnet-3.5 v2 (14.1), Samaya-QA (24.8), Samaya-QAx16 (30.7) — Figure 3: Scaling trends on CriteriaEval.

Stay tuned for more! 😎

Contributors

Jack led the development of RealityBench, ran the experiments, and wrote this blog with feedback from Ashwin Paranjape and Yuhao Zhang.

Varun Patel and Jack worked together to bring Samaya-QAx16 into production.

Special thanks to Yuhao Zhang, Thejas Venkatesh, Christos Baziotis, Alexander Baranov, Ashwin Paranjape, Maithra Raghu, Michele Bevilaqua, Sachin Raj, Suharsh Sivakumar, Fabio Petroni, and Roberto Dessi for their valuable discussions and feedback.

Footnotes

[1] We ask claude-sonnet-3.5 v2 to make a prediction based on its commonsense understanding of the world, and knowledge of past events. All RealityBench instances were mined more recently than the training data cutoff of all component models of all systems, including claude-sonnet-3.5 v2, so it’s unlikely that models have simply memorized these facts.

[2] The Grounded RAG baseline is: an e5-base retriever, a proprietary samaya reranker, and claude-sonnet-3.5 v2 to generate an answer to the question given retrieved results. The final prompt is “grounded” in that it asks the model to cite claims/facts from the given search results.