Making Retrieval-Augmented Language Models Robust to Irrelevant Context
Ori Yoran, Tomer Wolfson, Ori Ram, Jonathan Berant

TL;DR
This paper analyzes the negative impact of irrelevant retrieved information on retrieval-augmented language models and proposes methods to improve their robustness, ensuring relevant information benefits performance without being harmed by irrelevant data.
Contribution
It introduces two techniques—NLI-based filtering and data augmentation with relevant and irrelevant contexts—to enhance RALMs' robustness against irrelevant retrievals.
Findings
Filtering out non-entailing passages prevents performance drops.
Fine-tuning with mixed relevant and irrelevant data improves robustness.
Models trained with 1,000 examples maintain high accuracy with irrelevant contexts.
Abstract
Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not. This is particularly important in multi-hop reasoning scenarios, where misuse of irrelevant evidence can lead to cascading errors. However, recent work has shown that retrieval augmentation can sometimes have a negative effect on performance. In this work, we present a thorough analysis on five open-domain question answering benchmarks, characterizing cases when retrieval reduces accuracy. We then propose two methods to mitigate this issue. First, a simple baseline that filters out retrieved passages that do not entail question-answer pairs according to a natural language inference (NLI)…
Peer Reviews
Decision·ICLR 2024 poster
1. The paper is well-written and very easy to follow. (I give 3/4 for presentation only because of note #2 in weaknesses.) 2. The work is highly systematic, starting from first principles and building multiple rich systems for RALM, with well-conducted experiments sprinkled throughout to support all key claims. The results are solid. 3. The multi-hop data generation approach is novel and interesting.
1. If I understand correctly, you use the irrelevant context (e.g., in the single-hop case) to train the LM to answer the question by ignoring the context. Isn't this (almost) the definition of hallucination? The resulting LM will produce information not grounded in any passages. Isn't it better to abstain / request a new query, if the context is irrelevant? 2. More fundamentally, it seems like the take-away message is almost presented as "you should finetune on some examples with irrelevant/di
The core problem of degraded RALM performance due to irrelevant context is very compelling from both practical application and general research perspectives. Also it is useful that, besides providing a "baseline" of sorts, the simpler NLI approach is suitable for applications where fine-tuning is not feasible or greater system modularity is desired for architectural reasons. Overall, the presentation and writing are clear. The variation in the benchmark types (single-hop, explicit and implici
I would have found it helpful to have other overall findings briefly summarized at a bit higher level for, eg a practitioner trying to build an RALM application. Something like: "NLI-filtering can increase robustness to noisy IR, but at the cost of leaving IR gains on the table in some cases due to False Negatives. If possible for your setting, fine-tuning the model with intentionally varied IR quality seems to improve robustness without sacrificing performance."" Figure 4 and Figure 5 conveyed
s1. This paper presents a through analysis for the robustness of RALM to noisy context, which is fundamentally important to the research of LLM, question answering, and information retrieval. s2. Despite simplicity, the two proposed methods are meaningful and empirically positive.
w1. The use of a filtering module to mitigate contextual noise and filtering the language model with noisy context are two established approaches found in various related works on open-domain question answering and conversational question answering. While there may be variations in specific implementations, they might not be regarded as technical breakthroughs for this problem. This paper should conduct a more comprehensive investigation of related techniques. w2. The experimental study can be
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
