TL;DR
This paper introduces RARE, a comprehensive framework and benchmark for evaluating the robustness of retrieval-augmented generation systems against real-world noise and dynamic data changes, revealing their sensitivities.
Contribution
The paper presents RARE, a novel unified evaluation framework with a large-scale, dynamic benchmark and automated question generation pipeline for assessing RAG systems' robustness.
Findings
RAG systems are surprisingly sensitive to perturbations.
Lower robustness observed on multi-hop queries across domains.
The benchmark reveals significant vulnerabilities in current RAG systems.
Abstract
Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper identifies key limitations in existing RAG evaluation datasets and introduces a dynamic data generation pipeline that leverages knowledge graphs to automatically construct both single-hop and multi-hop questions. The proposed pipeline has strong potential value for the community, as it can generate diverse and complex datasets from unstructured data. In addition, the proposed dataset **RARE-Set** and the robustness evaluation metric contribute useful resources for assessing RAG syst
Although the paper includes experiments with various LLMs, the baseline methods are limited to standard RAG setups. Recent state-of-the-art RAG variants, such as adaptive or noise-robust models, are not included, which makes it difficult to fully assess the effectiveness of the proposed benchmark and evaluation metric. Incorporating results from these stronger baselines would significantly enhance the rigor and credibility of the evaluation.
- The proposed method is highly scalable: RARE-Get uses KG-driven synthesis to automatically generate multi-hop questions without manual curation, focusing on specialized, time-sensitive corpora appropriate for real-world RAG applications. - This paper proposes well-defined metrics: RARE-Met distinguishes memorization from retrieval-based reasoning and tests robustness across query perturbations (typos, paraphrasing) and document perturbations (lexical/answer variations). Moreover, its evaluatio
- The paper does not analyze error propagation which harms its reliability. RARE-Get chains multiple LLMs (GPT-4.1 for extraction and generation, Claude variants for filtering and evaluation), where errors compound at each stage. The paper omits failure rates, data discarded during quality checks, and how extraction errors corrupt downstream question generation. - The paper assumes generated questions require multi-hop reasoning without validation—questions may be answerable through single-chunk
1. The authors introduce Retrieval-Aware Robustness Evaluation (RARE), a comprehensive framework designed for evaluating retrieval-augmented generation systems. 2. Extensive experiments across financial, economic, and policy domains with over 48,000 queries demonstrate that RARE effectively reveals critical limitations in RAG systems, particularly in multi-hop and domain-specific scenarios. 3. The analysis dimensions are thorough and insightful. The paper not only presents overall performance
1. Insufficient technical details in the knowledge graph construction process: The paper does not provide enough description of the core steps of knowledge graph construction, particularly the specific implementation of relation normalization (detailed process using E5-Mistral-7B-Instruct), the quality control mechanisms for handling extracted conflicting or incorrect triples, and how to manage entity alignment and conflict resolution during cross-document knowledge graph merging. 2. The comp
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · BART · Weight Decay · Multi-Head Attention · Attention Is All You Need
