2025.04.07

Promptriever: Elevating Retrieval Models with Instruction-Following Capabilities

Yuhao Zhang

At Samaya AI, our customers ask complex financial queries to quickly find accurate and comprehensive information. To support such complex financial workflows, we need retrieval systems that can precisely understand the subtle details in user queries. In collaboration with researchers from Johns Hopkins University, we created Promptriever, the first retrieval model built to handle complex queries with detailed instructions, and published our research at ICLR 2025.

In this blog post, we share insights into how Promptriever was created, highlight key findings from our experiments, and provide links to additional resources that help you explore further.

Why do we need retrieval models that follow instructions?

In a nutshell, modern retrieval models work by encoding both the user query and candidate passages into vector representations, and subsequently scoring the candidate passages based on their similarity to the query vector. Today’s retrieval models are primarily trained with short, simple web search queries, and as a result, they rely on keyword matching or shallow semantic associations to find the relevant passages.

Comparison between a regular web query and a real-world financial query. Financial queries typically contain detailed instructions such as time ranges, industry constraints, etc. — Figure 1: Examples of a regular query and a real-world financial query with detailed instructions covering various aspects of information need.

However, real-world financial queries are intricate. As illustrated in Figure 1 here, these queries are a lot more complex compared to typical web searches, and contain detailed instructions covering time ranges, industry constraints, definitions of relevance, etc. As a result, today’s off-the-shelf retrieval models yield many irrelevant results, forcing users to repeatedly refine their own queries – a process that quickly becomes frustrating and inefficient.

To solve this problem, with Promptriever we enhance retrieval models with substantial instruction-following capabilities to better understand the subtleties and nuances in user queries.

How does Promptriever work?

Promptriever is built on a bi-encoder architecture where the same model is used to encode both the query as well as the passages. For advanced understanding capabilities, Prompteriever uses a large language model (LLM) like the open-weights LLaMA model as its backbone. We trained Promptriever on a retrieval dataset augmented with long instructions, allowing it to adapt its definition of relevance based on detailed natural language prompts.

An illustration of the data generation process for Promptriever, where for each query, a long instruction as well as instruction-positive and instruction-negative passages are generated by prompting an LLM. — Figure 2: Data generation process for Promptriever.

To achieve this, for the publicly released version of Promptriver, we created a synthetic data generation process that starts with the MS MARCO dataset¹, which contains short queries together with their positive and negative passages mined from web searches.

Figure 2 presents a high-level sketch of our method. For a given short query and its positive passage from MS MARCO, we prompt an LLM to:

Generate a detailed instruction to augment the original query. To ensure generalization, we generate instructions with varying lengths and styles.
Select an “instruction-positive” passage, which is a positive match with the instruction-augmented query.
Generate “instruction-negative” passages, passages that are a positive match with the original query, but a negative match (i.e., irrelevant) with the instruction-augmented query.

The quality of instruction-positive and instruction-negative passages was crucial for successful training, and we filtered them with additional criteria to ensure their accuracy. This results in a final dataset consisting of ~491,000 queries.

With this dataset, we then trained a LLaMA-pretrained model with Low-Rank Adaptation (LoRA), following the RepLLaMA recipe².

Results

We evaluate Promptriever extensively on datasets covering instruction following and both in-domain and out-of-domain text retrieval. We compare with the RepLLaMA model as a primary baseline given the shared backbone model, and also include other state-of-the-art retrieval models.

Our experiments demonstrate that:

• Promptriever achieves state-of-the-art instruction-following capabilities. As shown in Figure 3, it outperforms other strong models by large margins on both the FollowIR and InstructIR datasets^3,4. Compared to RepLLaMA, it demonstrates greater robustness against variations in query phrasing, improving robustness scores by nearly 13 points on InstructIR.

• Promptriever demonstrates stronger out-of-domain (OOD) retrieval performance. Figure 4 illustrates how Promptriever compares against RepLLaMA and BM25 on the popular BEIR benchmark covering a diverse set of text domains. Note that the strong OOD performance is purely due to instruction training, as we never trained explicitly with data from other domains.

• Promptriever maintains competitive in-domain retrieval performance. On the TREC DL20 retrieval evaluation, Promptriever scores 72.3 nDCG@10 compared to 71.8 from RepLLaMA.

For more detailed ablation studies and analysis, please refer to the Promptriever paper.

Instruction following evaluation results on FollowIR and InstructIR datasets show that Promptriever outperforms all other compared models by large margins. — Figure 3: Instruction following evaluation results on the FollowIR and InstructIR datasets. Higher is better for all metrics. Gecko model results are from a proprietary API and not included for InstructIR.

Out-of-domain retrieval evaluation on the BEIR benchmark shows that Promptriever outperforms RepLLaMA and BM25 across a wide range of domains. — Figure 4: Out-of-domain retrieval evaluation on the BEIR benchmark under the nDCG@10 metric.

Impact

Published at ICLR 2025, Promptriever is the first research work that elevates retrieval model’s performance on queries with complex instructions. It opens up new possibilities to build retrieval systems that understand complicated user requirements without a complicated architecture.

To facilitate research work on this direction, we made available the following resources together with our paper:

Open-source code for training and evaluating Promptriever, as well as public model weights: https://github.com/orionw/promptriever
Synthetic dataset used to train Promptriver: https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions
HuggingFace-hosted demo built with the SciFact dataset: https://huggingface.co/spaces/orionweller/retrieval-prompting

Contributors

Orion Weller led the work of Promptriever during his internship at Samaya AI, under the mentorship of Jack Hessel, Yuhao Zhang and Ashwin Paranjape. Other collaborators include Benjamin Van Durme and Dawn Lawrie from the John Hopkins University.

Special thanks to members of the Samaya ML team, including Fabio Petroni, Michele Bevilaqua, Christos Baziotis and Roberto Dessi for their feedback on the research work, and Ashwin for his feedback on this blog post.

Footnotes

[1] MS MARCO is a large-scale dataset widely used for training and evaluating text retrieval and question answering systems.

[2] Our retrieval model is built on the idea and training recipe of RepLLaMA, a retrieval model fine-tuned on top of the LLaMA-2 pretrained weights. We use RepLLaMA as a primary baseline for comparison across the work.

[3] FollowIR and InstructIR are recently published benchmarks for evaluating retrieval models’ abilities at understanding queries with complex instructions and context.