Deep research agents are bottlenecked by information. The quality of an AI agent’s final answer depends not only on the agent’s reasoning ability, but also on whether its retrieval system can understand and translate a natural-language query into precise, high-signal context. This initial query translation process, known as query understanding, is a vital yet often overlooked component that impacts the accuracy of agentic pipelines.

Errors at this stage can quietly propagate through the entire research workflow, leading agents to reason over incomplete or irrelevant evidence. Because query understanding sits on the critical path of nearly every production agentic workflow, a good model must be semantically accurate, cheap, and fast.

In this post, we describe how we train specialist models to better understand financial queries and use that understanding to improve downstream agentic research. Our model outperforms Claude Sonnet 4.6 and Gemini 3.5 Flash on this task while delivering lower latency and lower inference cost.

Task reward vs median latency

Query Understanding Pareto Frontier

Pareto frontier chart for task performance vs latency. Y Axis = Query-understanding task score (higher is better). X Axis = median (p50) query latency in seconds.

‍

The Hidden Constraints Behind Financial Retrieval

Understanding financial queries is challenging because the information needed to answer them is often implicit. Users rarely specify exact company names, dates, or reporting periods, yet these details determine what evidence should be retrieved and analyzed.

For financial queries, two semantic attributes matter most: the companies being referenced and the time window implied by the user. In the case where neither applies, the agent should abstain rather than invent constraints that do not belong.

The table below illustrates different query understanding scenarios, why they are hard and the kinds of reasoning required to resolve them correctly.

Parameter decision	Why it is hard	Example	Expected behavior
Time	Relative references must be anchored correctly Implicit earnings call timelines and product milestones Misaligned fiscal vs calendar quarters	”What did [the automaker] say about [their flagship truck] on the last two earnings calls before the most recent one, and how does that compare to commentary from the quarter they first delivered it?”	Resolve the referenced reporting periods accurately, including anchoring on the product’s first-delivery quarter.
Company	Long-tail of rare entities, e.g. international companies and subsidiaries Infer indirect references based on context (product names, geographies)	”How has demand for [flagship smartphone] affected revenue and margins over the last year?”	Infer the reporting company from the product, then retrieve earnings calls, filings, and segment commentary rather than product coverage.
Abstention	Requires recognizing the absence of a constraint, not extracting one Conceptual queries often name entities but aren’t constraints	”What’s the difference between NOI and FFO for REITs?”	Emit no company or time parameter, so retrieval can use broader conceptual evidence instead of company-specific or time-bounded search.

‍

Predicting the right company and time window is essential because these parameters determine what evidence gets retrieved. The model also has to know when to abstain, so generic queries can follow broader retrieval paths instead of being forced into company-specific or time-bounded search. This balance is difficult: over-extraction hallucinates constraints and misdirects retrieval, while over-abstention misses implicit companies, underspecified time windows, and event-based anchors. Either failure can quietly throw the agent off, leading it to reason over the wrong evidence while still producing a coherent-looking answer.

Why not just use a Frontier Model?

Large frontier models are a natural candidate for this task, since they usually have the knowledge and reasoning ability needed to accurately resolve these attributes. However, they are too expensive and slow for a component that sits on the critical path of every query.

The tempting alternative is to use a fast state-of-the-art model. These models have a surprising amount of financial domain knowledge, but often fail to apply this knowledge with the right financial semantics. In our tests, Gemini 3 Flash frequently fell short at this task.

Representative Gemini 3 Flash errors: the model knows the underlying entities and relationships, but misses finance-specific distinctions such as listed entity vs. operating company, parent vs. subsidiary, and reported quarter vs. current quarter.
Task	Query	Correct answer	Mistake
Company Resolution	”Summarize Roche’s latest quarter results…”	Roche Holding AG (listed entity)	Resolves to F. Hoffmann-La Roche AG instead (operating subsidiary)
Company Resolution	”What’s Ring’s revenue contribution within Amazon?”	Ring Inc. + Amazon	Collapses Ring into Amazon, losing the entity the query is about
Time Understanding	”How does Costco’s most recent quarter compare to the previous one?”	Last two reported quarters	Resolves to the current unreported quarter and the latest reported quarter

‍

This makes training a natural path: open models often have the necessary world knowledge, but lack the financial judgment to apply it correctly. Post-training teaches those behaviors directly, specializing the model to resolve entities, interpret reporting periods, and abstain when constraints do not apply. The result is frontier-like query understanding with the cost, latency, and deployment advantages of a smaller model.

Training Recipe

We train a model to perform two coupled tasks:

Extraction: identifying the companies and time windows that should constrain retrieval.
Abstention: recognizing when no company or time constraint should be emitted.

To successfully do this, we adopt a two-stage post-training recipe. Supervised fine-tuning teaches the model the output schema and the basic extraction behavior, followed by reinforcement learning to tune the extraction–abstention tradeoff.

We evaluate these behaviors with separate metrics:

Company extraction: entity-level F1 score for extracted surface forms (e.g., “Apple”) and canonical representations of the company (”Apple Inc.”).
Time extraction: Intersection-over-Union (IoU) score over the predicted and labeled time ranges.
Abstention: accuracy on whether the model correctly decides to emit structured constraints or leave the query unconstrained.

Stage 1: Supervised Fine-Tuning‍

We start with Qwen 3 as the base model and train it on labeled examples of financial queries and structured outputs. This stage teaches the model the basic task: follow the output schema, extract entities consistently, and represent time constraints in the format downstream systems can consume.

SFT excels at imitation - teaching the model to follow desired structured outputs, map products to companies, and decode phrases like “latest quarter”. However, SFT fails at calibration. The model becomes good at formatting these outputs, but far too eager to trigger them. It learns how to extract, but not when to abstain. ‍

Stage 2: Reinforcement Learning‍

We then run RL to optimize the operating point directly. For each query, the model samples multiple candidate outputs. Each candidate is scored against the labeled answer using the same metrics we care about in evaluation: abstention correctness, company extraction quality, and time-window quality.

We adopt a Group Relative Policy Optimization (GRPO) style training regime: candidates are compared within the same query group, and the model is updated toward outputs that score better relative to the others. To ensure GRPO has meaningful signals to learn from, we decompose the reward signals into graded partial credit and an exact match bonus.

We use a composite reward score:

R_{\text{abstain}} + R_{\text{company}} + R_{\text{time}}

$RabstainR_{\text{abstain}}$ measures abstention accuracy, model gets full credit if it correctly decides whether to extract or abstain. (This contributes up to 0.2 per positive query and up to 1.0 per negative query, i.e., a correctly abstained negative scores the full sample.)
$RcompanyR_{\text{company}}$ measures company extraction score, graded credit based on a blend of surface-form F1 and canonical representation Jaccard similarity, plus a small bonus on exact match. (Capped at 0.4 for positive queries and does not contribute for negative queries.)
$RtimeR_{\text{time}}$ measures time-window score, graded credit based on temporal Intersection-over-Union (IoU), plus a small bonus on matching the exact time window. (Capped at 0.4 for positive queries and does not contribute for negative queries.)

RL gives the model a way to learn both extraction and abstention at once: be precise when constraints exist, abstain when they do not, and operate at the desired tradeoff. ‍

Making GRPO efficient

GRPO only learns when candidates within a rollout group differ in score. If all candidates are identical, or if they all receive the same score, the group provides little useful signal. A few choices mattered:

Focus RL on examples with variance. Easy examples were mostly solved by the SFT checkpoint, so their rollout groups tended to collapse to the same answer and produce little learning signal. We focused RL on medium, hard, and abstention-heavy cases where the model still made competing mistakes.
Keep rollouts diverse. After SFT, the model was confident enough that sampled outputs often looked nearly identical. We ran ablations for RL across temperature and dropout settings. At low temperature, rollouts collapse to near-identical outputs - a high fraction of rollout groups have zero reward variance and produce no learning signal. At high temperatures without dropout, training destabilizes. Enabling dropout together with higher sampling temperature kept candidates diverse and the within-group comparisons useful (as demonstrated in the figure below).‍
Use graded scores instead of binary scores. Binary right/wrong scoring creates many ties: all candidates can be equally right, equally wrong, or indistinguishable on a coarse metric. Graded company and time scores give partial credit, which creates a smoother ranking within each rollout group.

Choosing the Right Model Size

To determine the optimal model configuration for this task, we train Qwen 3 at four different parameter counts (1.7B, 4B, 8B, and 14B) and analyze the reward on the validation dataset after one RL epoch.

*Reward and EM score for different sizes of Qwen 3*
Model size	1.7B	4B	8B	14B
R (reward)	0.7935	0.8488	0.8549	0.8617
EM	0.4213	0.5389	0.5731	0.5787

‍

We select the 8B model because it offered the best balance between quality and serving efficiency. While the 14B model improved overall score slightly, it consumed nearly twice the memory and reduced batching capacity under load; an additional epoch of RL on the 8B model improved reward further without the associated serving costs.

Results

To evaluate our specialist model, we tested it against off-the-shelf models on a held-out test set of ~500 queries. Our model outperforms Claude Sonnet 4.6 and the Gemini Flash family, achieving the best reward and exact match (EM) scores.

By performing SFT alone, the model gains strong extraction capabilities. The SFT+RL further improves the extraction-vs-abstention tradeoff boundary of the SFT model.

Model comparison

Model Performance

Left: Performance across different training stages. Right: Performance compared to closed frontier models (Claude Sonnet 4.6 and Gemini Flash family). R = task reward, EM = exact match. Breakdown shows the reward components: abstention, company and time score.

‍

Among the four closed models, Gemini 3.5 Flash has the best all-round performance. The best performing model for company and time is Gemini 3 Flash but its abstention score is the lowest, showing the tradeoff.

Our model serves at a median latency of ~660ms, on a single H100 GPU, roughly 3.1x faster than Claude Sonnet 4.6 and ~1.25-2.2x faster than the Gemini Flash model family. The specialized model we train here wins on both dimensions and Pareto dominates the off-the-shelf API models.

Downstream Improvement

To quantify the downstream impact, we assess the final response of our agentic system when using the new query-understanding model. We compare this against our pre-existing production system (which uses a legacy query-understanding model). Answer quality is measured using Criteria Eval, our internal rubric-based evaluation suite, across hundreds of queries and thousands of grading criteria. The new model yielded a true Pareto improvement across quality, efficiency, and speed:

Higher quality: The average rubric satisfaction rate improved by 1.5pp.
Fewer tool calls: The average number of tool calls dropped by 4% due to fewer false starts.
Lower latency: Driven by reduction in tool calls, median time-to-first-token (TTFT) dropped by 23%.

Cost

Frontier APIs are cheap per request, but costs compound aggressively at scale for a high-volume component sitting on the critical path of every single query. A typical query for this task costs ~$2.45 per thousand requests using Gemini Flash. Alternatively, a single H100 GPU rented at $3.5 per hour unlocks massive cost efficiency at scale.

Daily request volume	API cost ($/day)	Self-hosted cost ($/day)	Winner
10,000	$24.50	$84.00 (1 GPU)	API wins
50,000	$122.50	$84.00 (1 GPU)	Comparable
500,000	$1,225.00	$168.00 (2 GPUs)	Self-host is ~7x cheaper
1,000,000	$2,450.00	$336.00 (4 GPUs)	Self-host is ~7x cheaper

‍

Once the traffic crosses the break-even point (~34K requests), the self-hosted specialist model becomes the obvious choice. Beyond unit economics, owning the model and deployment eliminates vendor lock-in and protects production from API rate limits during traffic spikes.

Conclusion

Building a high-quality query-understanding layer for financial agents requires navigating a critical tradeoff: The system must aggressively capture vague or implicit financial constraints, yet exhibit robust discipline to abstain entirely when a query requires broader conceptual search. Three main takeaways emerged from this work:

SFT teaches the model what to extract and how to format the response. RL, on the other hand, helps the model learn when to extract by directly optimizing the tension between extraction and abstention.
Model’s parametric knowledge is the limiting factor. Very small models may lack the baseline knowledge and instruction-following capacity required to perform the task reliably. On the flip side, scaling up the model after a certain model size might not yield returns.
Specialist models outperform generalized off-the-shelf APIs and deliver frontier performance while achieving a true Pareto improvement across quality, latency, and cost.

‍

Acknowledgments: We would like to thank the members of the ML team (Vishank Bhatia, Richard Diehl Martinez, Ashwin Paranjape, Yuhao Zhang) for their discussions and reviews of this work, the infrastructure team (Kyle Chang) for providing the robust foundation that made this work possible, and Paddu Raghavan for the assistance with design.