RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng; Tianyu Cao; Danqing Wang; Xinran Zhao; Zimeng Qiu; Morteza Ziyadi; Tongshuang Wu; Lei Li

arXiv:2506.00789·cs.CL·October 29, 2025

RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li

PDF

1 Repo 3 Reviews

TL;DR

This paper introduces RARE, a comprehensive framework and benchmark for evaluating the robustness of retrieval-augmented generation systems against real-world noise and dynamic data changes, revealing their sensitivities.

Contribution

The paper presents RARE, a novel unified evaluation framework with a large-scale, dynamic benchmark and automated question generation pipeline for assessing RAG systems' robustness.

Findings

01

RAG systems are surprisingly sensitive to perturbations.

02

Lower robustness observed on multi-hop queries across domains.

03

The benchmark reveals significant vulnerabilities in current RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- This paper identifies key limitations in existing RAG evaluation datasets and introduces a dynamic data generation pipeline that leverages knowledge graphs to automatically construct both single-hop and multi-hop questions. The proposed pipeline has strong potential value for the community, as it can generate diverse and complex datasets from unstructured data. In addition, the proposed dataset **RARE-Set** and the robustness evaluation metric contribute useful resources for assessing RAG syst

Weaknesses

Although the paper includes experiments with various LLMs, the baseline methods are limited to standard RAG setups. Recent state-of-the-art RAG variants, such as adaptive or noise-robust models, are not included, which makes it difficult to fully assess the effectiveness of the proposed benchmark and evaluation metric. Incorporating results from these stronger baselines would significantly enhance the rigor and credibility of the evaluation.

Reviewer 02Rating 4Confidence 4

Strengths

- The proposed method is highly scalable: RARE-Get uses KG-driven synthesis to automatically generate multi-hop questions without manual curation, focusing on specialized, time-sensitive corpora appropriate for real-world RAG applications. - This paper proposes well-defined metrics: RARE-Met distinguishes memorization from retrieval-based reasoning and tests robustness across query perturbations (typos, paraphrasing) and document perturbations (lexical/answer variations). Moreover, its evaluatio

Weaknesses

- The paper does not analyze error propagation which harms its reliability. RARE-Get chains multiple LLMs (GPT-4.1 for extraction and generation, Claude variants for filtering and evaluation), where errors compound at each stage. The paper omits failure rates, data discarded during quality checks, and how extraction errors corrupt downstream question generation. - The paper assumes generated questions require multi-hop reasoning without validation—questions may be answerable through single-chunk

Reviewer 03Rating 4Confidence 3

Strengths

1. The authors introduce Retrieval-Aware Robustness Evaluation (RARE), a comprehensive framework designed for evaluating retrieval-augmented generation systems. 2. Extensive experiments across financial, economic, and policy domains with over 48,000 queries demonstrate that RARE effectively reveals critical limitations in RAG systems, particularly in multi-hop and domain-specific scenarios. 3. The analysis dimensions are thorough and insightful. The paper not only presents overall performance

Weaknesses

1. Insufficient technical details in the knowledge graph construction process: The paper does not provide enough description of the core steps of knowledge graph construction, particularly the specific implementation of relation normalization (detailed process using E5-Mistral-7B-Instruct), the quality control mechanisms for handling extracted conflicting or incorrect triples, and how to manage entity alignment and conflict resolution during cross-document knowledge graph merging. 2. The comp

Code & Models

Repositories

leililab/rare
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · BART · Weight Decay · Multi-Head Attention · Attention Is All You Need