Before the Search: Why Query Understanding Matters for Financial Agents
Deep research agents are bottlenecked by information. The quality of an AI agent’s final answer depends not only on the agent’s reasoning ability, but also on whether its retrieval system can understand and translate a natural-language query into precise, high-signal context. This initial query translation process, known as query understanding, is a vital yet often overlooked component that impacts the accuracy of agentic pipelines.
Errors at this stage can quietly propagate through the entire research workflow, leading agents to reason over incomplete or irrelevant evidence. Because query understanding sits on the critical path of nearly every production agentic workflow, a good model must be semantically accurate, cheap, and fast.
In this post, we describe how we train specialist models to better understand financial queries and use that understanding to improve downstream agentic research. Our model outperforms Claude Sonnet 4.6 and Gemini 3.5 Flash on this task while delivering lower latency and lower inference cost.
Understanding financial queries is challenging because the information needed to answer them is often implicit. Users rarely specify exact company names, dates, or reporting periods, yet these details determine what evidence should be retrieved and analyzed.
For financial queries, two semantic attributes matter most: the companies being referenced and the time window implied by the user. In the case where neither applies, the agent should abstain rather than invent constraints that do not belong.
The table below illustrates different query understanding scenarios, why they are hard and the kinds of reasoning required to resolve them correctly.
Predicting the right company and time window is essential because these parameters determine what evidence gets retrieved. The model also has to know when to abstain, so generic queries can follow broader retrieval paths instead of being forced into company-specific or time-bounded search. This balance is difficult: over-extraction hallucinates constraints and misdirects retrieval, while over-abstention misses implicit companies, underspecified time windows, and event-based anchors. Either failure can quietly throw the agent off, leading it to reason over the wrong evidence while still producing a coherent-looking answer.
Large frontier models are a natural candidate for this task, since they usually have the knowledge and reasoning ability needed to accurately resolve these attributes. However, they are too expensive and slow for a component that sits on the critical path of every query.
The tempting alternative is to use a fast state-of-the-art model. These models have a surprising amount of financial domain knowledge, but often fail to apply this knowledge with the right financial semantics. In our tests, Gemini 3 Flash frequently fell short at this task.
This makes training a natural path: open models often have the necessary world knowledge, but lack the financial judgment to apply it correctly. Post-training teaches those behaviors directly, specializing the model to resolve entities, interpret reporting periods, and abstain when constraints do not apply. The result is frontier-like query understanding with the cost, latency, and deployment advantages of a smaller model.
We train a model to perform two coupled tasks:
To successfully do this, we adopt a two-stage post-training recipe. Supervised fine-tuning teaches the model the output schema and the basic extraction behavior, followed by reinforcement learning to tune the extraction–abstention tradeoff.
We evaluate these behaviors with separate metrics:

We start with Qwen 3 as the base model and train it on labeled examples of financial queries and structured outputs. This stage teaches the model the basic task: follow the output schema, extract entities consistently, and represent time constraints in the format downstream systems can consume.
SFT excels at imitation - teaching the model to follow desired structured outputs, map products to companies, and decode phrases like “latest quarter”. However, SFT fails at calibration. The model becomes good at formatting these outputs, but far too eager to trigger them. It learns how to extract, but not when to abstain.
We then run RL to optimize the operating point directly. For each query, the model samples multiple candidate outputs. Each candidate is scored against the labeled answer using the same metrics we care about in evaluation: abstention correctness, company extraction quality, and time-window quality.
We adopt a Group Relative Policy Optimization (GRPO) style training regime: candidates are compared within the same query group, and the model is updated toward outputs that score better relative to the others. To ensure GRPO has meaningful signals to learn from, we decompose the reward signals into graded partial credit and an exact match bonus.
We use a composite reward score:
RL gives the model a way to learn both extraction and abstention at once: be precise when constraints exist, abstain when they do not, and operate at the desired tradeoff.
GRPO only learns when candidates within a rollout group differ in score. If all candidates are identical, or if they all receive the same score, the group provides little useful signal. A few choices mattered:

To determine the optimal model configuration for this task, we train Qwen 3 at four different parameter counts (1.7B, 4B, 8B, and 14B) and analyze the reward on the validation dataset after one RL epoch.
We select the 8B model because it offered the best balance between quality and serving efficiency. While the 14B model improved overall score slightly, it consumed nearly twice the memory and reduced batching capacity under load; an additional epoch of RL on the 8B model improved reward further without the associated serving costs.
To evaluate our specialist model, we tested it against off-the-shelf models on a held-out test set of ~500 queries. Our model outperforms Claude Sonnet 4.6 and the Gemini Flash family, achieving the best reward and exact match (EM) scores.
By performing SFT alone, the model gains strong extraction capabilities. The SFT+RL further improves the extraction-vs-abstention tradeoff boundary of the SFT model.
Among the four closed models, Gemini 3.5 Flash has the best all-round performance. The best performing model for company and time is Gemini 3 Flash but its abstention score is the lowest, showing the tradeoff.
Our model serves at a median latency of ~660ms, on a single H100 GPU, roughly 3.1x faster than Claude Sonnet 4.6 and ~1.25-2.2x faster than the Gemini Flash model family. The specialized model we train here wins on both dimensions and Pareto dominates the off-the-shelf API models.
To quantify the downstream impact, we assess the final response of our agentic system when using the new query-understanding model. We compare this against our pre-existing production system (which uses a legacy query-understanding model). Answer quality is measured using Criteria Eval, our internal rubric-based evaluation suite, across hundreds of queries and thousands of grading criteria. The new model yielded a true Pareto improvement across quality, efficiency, and speed:
Frontier APIs are cheap per request, but costs compound aggressively at scale for a high-volume component sitting on the critical path of every single query. A typical query for this task costs ~$2.45 per thousand requests using Gemini Flash. Alternatively, a single H100 GPU rented at $3.5 per hour unlocks massive cost efficiency at scale.
Once the traffic crosses the break-even point (~34K requests), the self-hosted specialist model becomes the obvious choice. Beyond unit economics, owning the model and deployment eliminates vendor lock-in and protects production from API rate limits during traffic spikes.
Building a high-quality query-understanding layer for financial agents requires navigating a critical tradeoff: The system must aggressively capture vague or implicit financial constraints, yet exhibit robust discipline to abstain entirely when a query requires broader conceptual search. Three main takeaways emerged from this work:
Acknowledgments: We would like to thank the members of the ML team (Vishank Bhatia, Richard Diehl Martinez, Ashwin Paranjape, Yuhao Zhang) for their discussions and reviews of this work, the infrastructure team (Kyle Chang) for providing the robust foundation that made this work possible, and Paddu Raghavan for the assistance with design.