Criteria-Eval: Evaluating Long-form Answers to Complex Questions

Christos Baziotis Christos Baziotis

After saturating component-wise evaluations of our question-answering system, we created a challenging end-to-end evaluation benchmark that allowed us to double our system quality within the last nine months. We are excited to share our evaluation recipe with the community.

We’re entering the era of AI agents. These systems do not just retrieve isolated facts or generate brief summaries. They synthesize multiple sources of complex information to answer challenging, multi-dimensional questions. At Samaya, our AI agents routinely face such demanding queries from financial analysts. For example:

“Summarize NVIDIA’s latest earnings call, compare key financial metrics against previous forecasts, and provide a detailed SWOT analysis based on the company’s strategic outlook.”

These questions require long detailed answers, spanning multiple paragraphs or even pages. Providing a good answer requires coordination across several specialized components. Evaluating the quality of these complex outputs is challenging because traditional component-level metrics miss crucial interactions between components.

In this blog post, we introduce Criteria-Eval, our internal evaluation framework specifically designed to measure the end-to-end quality of complex, long-form answers. Instead of relying on reference answers or generic rubrics, Criteria-Eval uses objective, yes-or-no checklists written by domain experts to assess factual accuracy, completeness, formatting, and more.

This approach makes evaluation precise, flexible, and, most importantly, directly aligns with how expert users judge quality.

Domain experts first research authoritative sources to answer a complex, multi-part financial query. They then produce a binary checklist of factual and formatting criteria that any correct answer must meet. The system's response is evaluated against this checklist using an LLM-as-a-Judge setup, with each criterion scored independently. The final score reflects the proportion of checklist items satisfied, capturing the end-to-end quality of the answer.
Figure 1: Domain experts first research authoritative sources to answer a complex, multi-part financial query. They then produce a binary checklist of factual and formatting criteria that any correct answer must meet. The system’s response is evaluated against this checklist using an LLM-as-a-Judge setup, with each criterion scored independently. The final score reflects the proportion of checklist items satisfied, capturing the end-to-end quality of the answer.

Why End-to-End Evaluation Matters

When building complex systems, component-level evaluation is a natural starting point. We began there, measuring our retrievers, entity disambiguators, query orchestrators, etc., with targeted metrics. While useful individually, these metrics fail to account for critical interactions and error propagation between components.

What we ultimately care about is end-to-end performance:

Did the user receive the accurate, complete, and actionable answer they needed?

Component-level metrics simply can’t capture this.

Early on, we heavily relied on side-by-side product testing sessions and user feedback to judge quality end-to-end. It was manual, slow, and reactive. Crucially, it made catching regressions and making confident deployment decisions incredibly difficult. This was a gap we needed to fill.

Criteria-Eval: A Checklist-based Evaluation

Criteria-Eval addresses these shortcomings by shifting the evaluation mindset from reference generation to criteria specification:

We use domain experts (financial professionals) to define the criteria that a good answer should meet rather than asking them to write reference answers.

This checklist-based approach (Ribeiro et al. 2020) enables us to explicitly evaluate both retrieval quality and generation accuracy simultaneously.

Each criterion is a simple, clear assertion that must explicitly appear in the final answer (e.g., “Apple’s fiscal 2023 revenue was $383.3 billion”). This approach has several advantages:

  • Objective: Allows scoring each criterion with a straightforward yes/no decision. This makes evaluation objective and highly reproducible by human annotators or automated evaluators.
  • Flexible: Checklists also allow multiple valid ways to answer the same query, as long as critical criteria are met.1
  • Instruction-aware: They also allow measuring instruction following (e.g., formatting requests) besides factual accuracy, completeness, and other fact-based metrics.
  • Expert-aligned: The criteria directly encode the expectations of and the knowledge of domain experts.

Dataset Overview

We started collecting data for Criteria-Eval approximately a year ago, and we’ve continuously expanded our dataset since then. As of now, the dataset includes:

  • 3000 queries, reflecting a diverse spectrum of realistic financial analyst questions. These questions cover the majority of use cases encountered by our users at Samaya, ranging from straightforward factual queries to complex strategic analyses.
  • Each query has one or multiple independent annotations by different experts, capturing a range of valid responses and perspectives.
  • Queries range significantly in complexity, typically comprising from 5 up to 40 or more criteria, with the majority of queries averaging around 20 criteria each.

This dataset represents a substantial investment of expertise and time, with our financial domain experts dedicating more than 8000 hours to meticulous annotation work.

Annotation Process

Annotators adopt the role of a financial research analyst asked to answer a query. Their deliverable is an exhaustive bullet-point checklist with the facts, analyses, and formatting cues a perfect two-page report would need.

The process is governed by comprehensive internal guidelines.2 The workflow looks like this:

  1. Fix the time: set a “query date” and all cited data must have been publicly available on or before that date, preventing hindsight leakage.3
  2. Lock the scope: draft one or more Methodology criteria that lock in coverage choices: time window, peer set, sampling logic, presentation format, etc. This makes every plan reproducible.
  3. Research: consult only authoritative sources (SEC filings, earnings transcripts, market data, regulatory releases, industry reports). AI tools, including Samaya, are explicitly banned.4
  4. Write the checklist: for each insight, write a full-sentence bullet that is specific, atomic, and begins with an action verb (“Provides…”, “Includes…”). Each bullet is tagged with at least one data source.

Taxonomy of Questions

We have collected a wide range of questions financial analysts actually ask. The table below shows the most common types of questions.

CategoryTypical ExampleWhat the checklist will demand
Point-fact retrieval”What was Exxon’s Q2 2024 revenue?”The exact figure, source doc, currency code, time reference.
Time-series & deltas”How have Spotify premium subs trended over the past year?”Each quarterly datapoint, growth rates, chart instruction.
Event digests”Summarize today’s Google earnings call.”Call logistics, headline metrics vs. consensus, CEO sound-bites, key themes.
Forward guidance & comparisons”What segment guidance did Microsoft give last call, and how does it differ from prior outlook?”New vs. old numbers, context notes, table formatting.
Thematic extraction”Which companies mention consumers trading down?”List of companies, supporting quotes, classification rationale.
Strategic analysis”Give me a SWOT for NVIDIA.”Clearly separated strengths, weaknesses, etc., each with supporting evidence.
Other
Table 1: Taxonomy of the most common types of questions in our dataset, along with a typical example and what the checklist will demand.

Scoring Method

Once the checklist is written, Criteria-Eval uses LLM-based judges to evaluate system responses. Each answer is scored against its checklist, one criterion at a time.

For every criterion, the LLM judge reads the system’s output and returns a binary judgment:

  • Pass: if the answer fully satisfies the criterion.
  • Fail: if the fact, analysis, or instruction is missing, incorrect, ambiguous, or incomplete.

The final score is simply the percentage of criteria met.

Criteria-Eval assigns high scores if and only if the system both retrieves and faithfully presents all these critical facts. Missing criteria immediately indicate specific retrieval or generation gaps.

Built to Resist Saturation

Criteria-Eval includes many open-ended and researchy questions such as:

  • “What’s new with Company X?” or
  • “Tell me some interesting developments in [sector]”

While each query typically contains a set of core criteria that are critical (but easier) to address, there is also a deliberate, extensive tail of supplementary criteria. This is by design: by asking annotators to provide exhaustive and detailed criteria we are preventing Criteria-Eval from saturation.

We measured the overlap between the sets of criteria generated by different annotators for the same questions, and we found that on average the overlap is roughly 29%. This reflects the fact that there are multiple valid answers to the same question and provides a benchmark for our systems to reach human level performance on this ambiguous task.5

In a landscape where benchmarks are quickly saturated, Criteria-Eval remains demanding and provides meaningful signal for improvement.

Examples

Below are four representative examples of how Criteria-Eval is applied to real-world financial questions. Each example illustrates a specific type of analyst query, along with the corresponding expert-generated checklist criteria used to assess answer quality. Select the tabs to explore each query type in detail.

Event digest challenge: Extract and organize key financial information from a specific corporate event, requiring both accurate retrieval and thematic organization.

CriterionCategoryNotes
States that the Q2 2024 earnings calls transcript of CVS Health Corporation (CVS) has been analyzed and considered.MethodologyEstablishes the data source and time period for analysis.
Provides that CVS Health Corporation (CVS) Q2 2024, reported adjusted earnings per share of USD 1.83.Earnings Call TranscriptCore financial metric that must be included in any earnings summary.
Provides that CVS Health Corporation (CVS) Q2 2024, reported total revenues of more than USD 91.00 billion.Earnings Call TranscriptKey revenue figure that provides scale context for the business.
Presents the above information by making the points in bold format.FormatInstruction-following criterion. Answer must use bold formatting as specified.
Presents the above information by grouping the points into themes where all the points with figures are categorized as Financials, while points with future prediction as Guidance and remaining as Business Overview.FormatTests ability to organize information thematically for better readability.
… additional criteria …

Strategic analysis challenge: Synthesize information across multiple sources to identify corporate strategy rationales not explicitly stated in any single document.

CriterionCategoryNotes
Provides information and analysis for Meta Platforms Inc’s (META) strategy to open source their LLMs for free from Earnings Call Transcripts, Earnings Releases, Press Releases, Industry News Websites and Financial News Websites.MethodologyEstablishes the scope and sources for a comprehensive analysis.
Present all the information and analysis for Meta Platforms Inc’s (META) strategy to open source their LLMs for free in bullet points.FormatInstruction-following criterion. Requires specific formatting.
Provides Meta Platform Inc’s comments on Llama 3 mentioning that the company is embracing the open source ethos of releasing early and often to enable the community to get access to these models while they are still in development.Press ReleaseDirect statement from the company about their strategy.
Provides analysis that open-sourcing LLMs helps Meta Platforms Inc (META) compete with other big tech players. By providing free access to their LLMs, Meta aims to compete with other tech giants like OpenAI, Google, and Microsoft, positioning themselves as a leader in AI research and development while differentiating from more closed-off models.Professional AnalysesStrategic context that explains competitive positioning.
… additional criteria …

Thematic extraction challenge: Identify a specific business theme across multiple companies, requiring both broad retrieval and consistent pattern extraction from diverse sources.

CriterionCategoryNotes
Provides an analysis of the top 10 companies in the S&P 500 by market capitalization regarding their situation with supply constraints.MethodologyDefines scope and approach for a manageable analysis.
States that the top 10 companies in the S&P 500 by market capitalization are Apple (AAPL), Microsoft (MSFT), Nvidia (NVDA), Amazon (AMZN), Meta Platforms (META), Alphabet Inc (GOOG), Berkshire Hathaway (BRK-B), Broadcom (AVGO), Eli Lilly and Company (LLY), and JPMorgan Chase & Co (JPM).Stock Price DataEstablishes the specific companies to be analyzed.
States that Nvidia (NVDA) had a higher demand for its products than it has capacity for, based on the earning call for Q1 2025 on May 22, 2024, inferring missed revenues.Earnings Call TranscriptExample of supply constraint with business impact.
States that Eli Lilly and Company (LLY) has seen a decrease in their Trulicity revenue by 31% due to supply chain constraints in the earning call for Q2 2024 on August 8, 2024.Earnings Call TranscriptQuantifies specific revenue impact of supply constraints.
Provides the information gathered in a table format, including the company name, ticker, the latest quarter comment, and the previous quarter comment.FormatRequires structured presentation for easy comparison.
… additional criteria …

Precise comparison challenge: Retrieve specific financial metrics and compare them to market expectations, requiring both factual accuracy and structured presentation of complex numerical data.

CriterionCategoryNotes
Provides analysis of Uber Technologies (UBER) revenue and earnings per share last quarter relative what the consensus revenue and earnings per share estimates were for last quarter and presents the data in table formatMethodologyDefines the scope and presentation format for the analysis.
Provides revenue for Uber Technologies (UBER) in Q2 2024 as USD 10.70 billionEarnings ReleaseCore revenue figure from official company reporting.
Provides the S&P Capital IQ consensus revenue estimate for Uber Technologies (UBER) in Q2 2024 as USD 10.59 billionConsensus DataMarket expectation benchmark for comparison.
Provides earnings per share for Uber Technologies (UBER) in Q2 2024 as USD 0.47Earnings ReleaseCore profitability metric from official company reporting.
Presents Q2 2024 revenue and earnings per share data for Uber Technologies (UBER) in a table that includes data values, year-over-year growth percentage, consensus estimates, amount by which the data values differed from consensus estimates, and the count of consecutive quarterly outperformanceFormatRequires comprehensive tabular presentation with multiple data points.
… additional criteria …

Analysis: Impact of Query Complexity

To understand how answer quality scales with increasing question complexity, we compared two distinct retrieval strategies using Criteria-Eval. Both systems use the same LLM to synthesize final answers, but differ in how they retrieve supporting evidence.

  • RAG: A conventional RAG (naive) baseline, that embeds queries with a dense retrieval model and uses a state-of-the-art LLM to generate the answer.
  • Samaya: Samaya’s specialized Q&A pipeline that leverages multiple expert models for query understanding, retrieval, reasoning and synthesis.

We grouped queries by the number of criteria per query, serves as a proxy for question complexity. Complexity varied from simpler queries (~5 criteria) to highly intricate queries (~35+ criteria).

More criteria generally signal a more demanding, multi-faceted question

Criteria-Eval accuracy across queries grouped by number of criteria. Each group reflects increasing query complexity. While both systems perform similarly on simple queries (less than 10 criteria), the RAG baseline degrades sharply as complexity grows. Samaya maintains stable performance across all difficulty levels, demonstrating robustness in handling complex, multi-faceted questions.
Figure 2: Criteria-Eval accuracy across queries grouped by number of checklist criteria. While both systems perform similarly on simple queries (5–10 criteria), the RAG baseline degrades sharply as complexity grows. Samaya maintains stable performance across all difficulty levels. The dashed line at represents the average criteria overlap between human annotators on the same query, providing a useful human reference point.

To help contextualize these results, the dashed line marks the average overlap between the set of criteria generated by different annotators for the same query. This provides a useful human reference point for interpreting system performance (see the Built to Resist Saturation section for more details). The results are shown in Figure 2, and we observe the following:

  • Both RAG and Samaya start at similar accuracy levels for simpler queries.
  • However, a complexity increases, a clear pattern emerges:
    • The RAG baseline steadily declines, and struggles to surface the full set of relevant facts for the most complex queries, dropping to about 10% accuracy.
    • Samaya maintains a stable performance, even for queries with more than 35 criteria.

These results clearly demonstrate the benefits of a specialized multi-component retrieval and orchestration pipeline, especially as query complexity increases and conventional RAG approaches fall short. Criteria-Eval inherently reveals these differences by naturally encoding query complexity through its criteria-driven scoring.

Why Existing Metrics Weren’t Enough

Current generative evaluation paradigms primarily rely on LLM-as-a-Judge (Zheng et al., 2023), where an auxiliary LLM scores responses against predefined or LLM-generated rubrics (e.g., G-Eval; Liu et al., 2023). While suitable for many general-purpose generation tasks, such as summarization, these rubrics fall short for knowledge-intensive tasks, where quality depends equally on retrieving and faithfully presenting key facts.

Even specialized metrics like FactScore have limitations (Min et al., 2023). FactScore evaluates faithfulness by extracting all claims from a response and verifying them against the retrieved context. While useful for spotting hallucinations, it comes with some limitations:

  • Closed-world assumption: FactScore assumes the retrieved context is complete and correct. If a crucial fact was never retrieved in the first place, it cannot detect its absence. A response can be 100% faithful to the context, yet still be factually wrong or incomplete.
  • No penalty for omissions: If important facts were retrieved but never included in the response, FactScore doesn’t penalize this. It only evaluates what’s present.
  • No instruction-awareness: FactScore is designed for factual accuracy, not broader compliance with user instructions, like specific formatting requests.

Conclusion

Criteria-Eval is now a core part of how we build, monitor, and improve our QA systems at Samaya:

  • We run automated daily Criteria-Eval runs to monitor our latest model deployments and catch regressions before users ever see them.
  • Our engineers receive immediate actionable alerts whenever significant regressions are detected, allowing quick diagnosis.
  • The ML research team leverages Criteria-Eval insights to efficiently debug complex queries and prioritize research efforts
  • By providing clear end-to-end signals directly aligned with real user experiences, Criteria-Eval has unlocked the ability to train pipelines with reinforcement learning, helping models learn to work optimally with each other, toward our vision of building co-adapted “Lattice of Experts” systems.

At Samaya, rigorous evaluation is not just a technical necessity but a core part of how we build our products. Criteria-Eval in particular has become the cornerstone of our quality assurance strategy, enabling us to consistently meet the high standards of professional analysts. It is an active and evolving dataset, and we continuously enrich it as new user queries emerge, ensuring our systems remain relevant and closely aligned with real-world needs.

As the AI community evolves toward more sophisticated, knowledge-intensive applications, we hope Criteria-Eval serves as a valuable case study for advancing evaluation methodologies.

Contributors

Ashwin Paranjape, Christos Baziotis, Jack Hessel, and Jack Silva led the design and conceptualization of Criteria-Eval.

Jack Silva led the annotation effort and managed the team of domain experts and wrote the annotation manual.

Mingyi Yang and Christos Baziotis contributed to the engineering and analysis effort.

Special thanks to Yuhao Zhang, Roberto Dessi, Fabio Petroni and Michele Bevilaqua for their valuable discussions and feedback throughout the project.

Footnotes

[1] Because there can be multiple good answers, especially for open-ended finance questions, we collect multiple annotations (criteria lists) per query.

[2] Our internal annotation manual spans over 50 pages and includes examples, decision rubrics, taxonomy definitions, and edge-case resolution protocols. This ensures consistency and domain rigour across annotators.

[3] “Hindsight leakage” refers to the use of information that wasn’t available at the time of the query. In Criteria-Eval, we prevent this by setting a “query date” and ensuring all cited data is publicly available on or before that date.

[4] AI tools, including Samaya, are explicitly banned to ensure that the answers are based solely on the authoritative sources provided. This keeps the evaluation criteria independent and prevents any AI-generated errors from affecting our evaluation framework.

[5] To quantify how much two annotators agree on what constitutes a good answer, we use a cross-evaluation strategy. We start by taking one annotator’s checklist and writing a “perfect” answer that satisfies all their criteria. Then, we evaluate that answer against the checklist written by a second annotator for the same query, exactly as we would evaluate any model-generated response. The proportion of checklist items satisfied in this setup serves as a proxy for the overlap between the two sets of criteria. Averaged across many such annotator pairs, this yields an empirical overlap score of approximately 29%. This number reflects the expected common ground between experts: while many core criteria (e.g., headline figures, key guidance) are shared, there is always an inevitable long tail of more subjective criteria that vary based on an annotator’s judgment. Importantly, if our system exceeds this inter-annotator overlap percentage, it indicates that the system is providing more comprehensive answers that successfully address a broader range of valid criteria than what a typical human expert might include in a single analysis.

References

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore.

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. (2023). FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1165–1183, Singapore.

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. (2020). Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 747–762, Online.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.


Appendix

A1. Choosing the LLM-as-a-Judge Setup

Criteria-Eval still relies on the LLM-as-a-Judge approach, as LLMs decides whether each checklist criterion is satisfied by a system’s answer. But before relying on any model, we ran an internal study to evaluate how well different setups align with expert human judgments.

We created a held-out dataset from Criteria-Eval where we sampled system outputs and then manually labeled them against their criteria. This gave us ground-truth judgments which we used to compare different scoring strategies:

  • Multi-LLM voting: We used three different LLMs for obtaining judgements and aggregated them using majority voting.
  • Single-LLM voting: A single LLM generates multiple judgements using different temperatures and then we combine them using majority voting across samples.

Both methods showed ~94% agreement with human judgements, but the multi-LLM setup aligned more reliably with human labels in edge cases.