Evaluating the Retrieval Robustness of Large Language Models

Shuyang Cao; Karthik Radhakrishnan; David Rosenberg; Steven Lu; Pengxiang Cheng; Lu Wang; Shiyue Zhang

arXiv:2505.21870·cs.CL·May 29, 2025

Evaluating the Retrieval Robustness of Large Language Models

Shuyang Cao, Karthik Radhakrishnan, David Rosenberg, Steven Lu, Pengxiang Cheng, Lu Wang, Shiyue Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper assesses how robust large language models are when using retrieval-augmented generation, revealing high robustness but also limitations in fully leveraging retrieved information across various models and strategies.

Contribution

It introduces a benchmark and metrics to evaluate retrieval robustness in LLMs, providing comprehensive insights into their performance and limitations.

Findings

01

LLMs show surprisingly high retrieval robustness

02

Imperfect robustness limits full benefit of RAG

03

Different models and strategies exhibit varying robustness levels

Abstract

Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper proposes a sample-level metric to measure retrieval robustness, which is highly relevant for the practical deployment of RAG systems. The research questions are well-chosen and address core concerns of RAG practitioners. Comprehensive experiments are conducted with 11 modern LLMs with 3 prompting strategies. The work gives insight of how modern LLMs react to external information with various quality. 2. Detailed experiment design! I really appreciate the analysis shown in Sec.5.3 "

Weaknesses

1. The experiment scopes remain limited on 1) knowledge-intensive open-domain QA tasks. 2) limited knowledge base for retrieval. The two limits limit the practical benefit of the research. Robustness on more domains and more various settings are encouraged. E.g., how the three metrics will be on Google Search? How the LLMs will behave if the queries are a blend of knowledge-intensive ones and non-knowledge-intensive ones. The good robustness of models on knowledge-intensive queries with Wikipedi

Reviewer 02Rating 2Confidence 4

Strengths

1) The paper defines three clear quantitative metrics—No Degradation Rate, Retrieval Size Robustness, and Retrieval Order Robustness—that formalize intuitive aspects of retrieval robustness in RAG systems. 2) It provides a practical benchmark of 1,500 open-domain QA samples using Wikipedia retrievals and two standard retrievers (BM25 and BGE), offering a reproducible setup grounded in real-world retrieval conditions. 3) The experimental coverage is broad, evaluating 11 LLMs and three prompting

Weaknesses

1) The novelty is limited. The core idea is primarily a large-scale empirical study rather than a methodological or theoretical innovation. The proposed metrics formalize well-known intuitions but do not introduce new techniques or insights into why robustness varies. 2) The benchmark focuses only on open-domain QA with Wikipedia, which restricts generalization to specialized domains or other RAG use cases such as reasoning, dialogue, or summarization. 3) The evaluation relies heavily on one j

Reviewer 03Rating 4Confidence 3

Strengths

1. The most significant strength of this paper is its exhaustive experimental design. The authors systematically investigate the robustness of 11 mainstream LLMs under a natural retrieval setting using the Wikipedia corpus. The study covers a wide range of variables and the experiments are thorough. Compared to previous benchmark work that focused on "artificially" constructing various types of adversarial noise, this paper's setup (strong LLM + natural top-K retrieval) helps us, in the current

Weaknesses

1. I think there is a lack of rigor in contribution positioning and over-claiming in this paper: - The authors' core argument is that "previous benchmark work used a large amount of artificially synthesized/constructed noise data, and is therefore not realistic". However, this argument ignores that the "real internet" retrieval environment is itself flooded with a large amount of noise, errors, and misinformation. From this perspective, previous work (e.g., inserting counterfactual noise) ca

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior