Can Machines Assist Humans in Verifying World Knowledge?

Fabio Petroni and Michele Bevilacqua

October 19, 2023

For millennia humans have been producing knowledge. The advent of computers helped turn this knowledge into a digital format, and the web made it widely accessible, boosting progress and further knowledge creation. Search engines have dramatically increased the breadth and accessibility of information that we can use in our everyday decision making, much beyond the limits of books and conventional libraries.

Yet, having access to a vast body of knowledge doesn't make us omniscient – making sense of that knowledge remains extremely difficult and time consuming. A researcher working on a new hypothesis still needs to read through literature, write notes, mentally connect the dots. A knowledge worker needs years of experience, research and reading to spot a novel correlation. Furthermore, the process is complicated by the fact that different pieces of information can be in conflict with each other and must be understood and traced back to their original sources.

Our recent study published in Nature, “Improving Wikipedia Verifiability with AI”, suggests that building machines to assist humans in navigating, interpreting, and verifying world knowledge is achievable today. Among the authors of the paper is our CTO Fabio Petroni, who conducted much of this work while at Meta AI's FAIR labs, and founding ML engineer Michele Bevilacqua.

Wikipedia is one of the most used collections of knowledge on the planet, an incredible collective effort to catalog what we know about the world, powered by volunteer and expert editors. A pillar of Wikipedia is verifiability: claims need to be backed by citations, so that “people using the encyclopedia can check that the information comes from a reliable source”. Yet, the task of ensuring verifiability of Wikipedia is titanic. Expert editors currently rely on the work of volunteers (you can help with tools such as citationhunt) that: (1) identify and tag claims likely to fail verification by performing fact and reference checks between the wikipedia claim and the cited source; (2) provide assistance by suggesting replacement for a failed verification citation with a reliable source that corroborates the claim. At the time of writing, over 500,000 statements on Wikipedia are marked as “Citation needed”.

In the paper, we show that a machine can assist editors with both tasks, discovering problematic citations and improving their verifiability. The system we propose, SIDE, leverages a specialized language model to evaluate Wikipedia claim-citation pairs, assigning verifiability scores and identifying potential evidence shortfalls. It is equipped with a retrieval engine able to extract in-context Wikipedia claims and search the web-scale 'Sphere' corpus for alternative sources. SIDE flags claims with low verifiability and suggests alternative sources, potentially guiding editors in their content curation process.

Figure1: The decision flow of SIDE from a claim on Wikipedia to a suggestion for a new citation.

In extensive evaluations, SIDE demonstrated a strong ability to pinpoint claims with failed verifications on Wikipedia. Not only does it identify these discrepancies, but when claims are deemed likely to be unverifiable, the system's suggested citations were more often preferred by Wikipedia users than the original Wikipedia citations.

Figure 2: Wikipedia users annotations via the SIDE demo.

At Samaya AI, we're developing advanced machines that redefine how experts manage, interact with, and ultimately craft knowledge. We're exploring new exciting directions and need the brightest minds to help navigate these uncharted territories. If you're fueled by curiosity and thrive on challenge, we're eager for you to join our quest. Reach out and connect with us at!

Read the full paper:

Get the code:

See the demo:

Information Overload: The Challenges of Expanding Context Windows in Large Language Models

Nelson Liu, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Maithra Raghu

August 3, 2023

Language models have demonstrated impressive performance for a variety of applications and use-cases, but limitations remain—for example, it is difficult to add knowledge beyond their pre-training knowledge cut-off, and they may generate factually incorrect statements. Overcoming these shortcomings requires incorporating external knowledge into language models.

The predominant current approach to incorporate external knowledge into language models is to use retrieval augmentation, where a retrieval system (e.g., a commercial search engine like Google) is used to fetch relevant and up-to-date knowledge for the language model to use. This enables language models to use knowledge beyond what’s seen during training.

A different approach to augmenting language models with knowledge beyond its training comes from expanding their context windows — increasing the number of words (tokens) that the language model can use as input. The state-of-the-art language models of 3 years ago had context windows of 2048 tokens (e.g. GPT-3; text-davinci-001). In the last few months however, context windows have increased by an order of magnitude: MosaicML released MPT-30B, an open-source model that supports 8K tokens, OpenAI released its extended-context GPT-3.5-Turbo 16K, and Anthropic’s Claude supports input contexts of up to 100K tokens.

In principle, longer context windows provide an appealing and simple alternative to retrieval-based augmentation—instead of using and maintaining a retrieval system, you might be able to fit all relevant knowledge into the language model’s input context, giving it access to all available knowledge. Even though the language model has all this knowledge, is it able to use all this knowledge when prompted to solve downstream tasks?

We address this question in our recent study, Lost in the Middle: How Language Models Use Long Contexts. This study evaluates whether long-context language models can robustly use knowledge in lengthy input contexts. We find that language model performance is highest when relevant knowledge occurs at the very start or end of long input contexts, and significantly degrades when models are forced to use knowledge in the middle of the context. These results indicate that current language models do not effectively use their entire context, and that retrieval is still a crucial ingredient for effectively augmenting language models with external knowledge.

Measuring how language models use their input context

To measure how language models use their input context, we evaluate their performance on multi-document question answering: given a user question and several relevant documents (exactly one of which contains the answer), the model is tasked with answering the user question. The figure below provides an example.

Input Context:
Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [1] (Title: Asian Americans in science and technology) Prize in physics for discovery of the subatomic particle J/ψ. Subrahmanyan Chandrasekhar shared...
Document [2] (Title: List of Nobel laureates in Physics) The first Nobel Prize in Physics was awarded in 1901 to Wilhelm Conrad Röntgen, of Germany, who received...
Document [3] (Title: Scientist) and pursued through a unique method, was essentially in place. Ramón y Cajal won the Nobel Prize in 1906 for his remarkable...

Question: who got the first nobel prize in physics

Desired Answer:
Wilhelm Wilhelm Röntgen

Figure 1: Example of the multi-document question answering task, with an input context and the desired model answer. The relevant document for correctly answering the request is bolded within the input context (number 2).

In particular, we study how model performance is affected by the position of relevant knowledge in the input context by changing the position of the document that contains the answer. For example, if we give the model a query and 20 documents, we can either place the document that contains the answer at the very start of the context (i.e., 1st position), in the middle (the 10th position), or at the end (the 20th position). If models are able to properly use all of their context, we expect that changing the location of the document that contains the answer should not affect the performance—the model should achieve high performance regardless of if the relevant document occurs first, in the middle, or last.

Input Context:
Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant).

Document [1](Title: List of Nobel laureates in Physics) The first Nobel Prize in Physics was awarded in 1901 to Wilhelm Conrad Röntgen, of Germany, who received...
Document [2](Title: Asian Americans in science and technology) Prize in physics for discovery of the subatomic particle J/ψ. Subrahmanyan Chandrasekhar shared...
Document [3](Title: Scientist) and pursued through a unique method, was essentially in place. Ramón y Cajal won the Nobel Prize in 1906 for his remarkable...

Question: who got the first nobel prize in physics

Desired Answer:
Wilhelm Wilhelm Röntgen

Figure 2: Modulating the position of relevant knowledge within the input context for the multi-document question answering example presented in Figure 2. Reordering the documents in the input context does not affect the desired output. The relevant document for correctly answering the request is bolded within the input context (number 1).

Our results (illustrated below; Figure 3) show that model performance is substantially affected by the position of relevant knowledge (the document that contains the answer) in the input context. GPT-3.5-Turbo has the largest accuracy degradation; its performance drops by more than 20% when the document containing the answer is moved from the front of the context (1st position) to the middle (10th position). These results indicate that current language models fail to robustly use long input contexts when prompted for downstream tasks.

Figure 3: The effect of changing the position of relevant knowledge (document containing the answer) on multi-document question answering performance. Lower positions are closer to the start of the input context. Performance is generally highest when relevant knowledge is positioned at the very start or very end of the context, and rapidly degrades when models must reason over knowledge in the middle of their input context.

What's next?

These findings are at the core of our work at Samaya AI, where we are developing a cutting edge platform for knowledge discovery and expert reasoning. We’re exploring several directions related to these results—improving language models’ abilities to reason over complex external information and how to effectively use them for knowledge retrieval and discovery. If you are interested in pushing the boundaries of LLM and knowledge, get in touch at at!

Read the full paper: Lost in the Middle: How Language Models Use Long Contexts

Get the code: