Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Linda Zeng; Rithwik Gupta; Divij Motwani; Yi Zhang; Diji Yang

arXiv:2502.16101·cs.AI·January 21, 2026

Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Linda Zeng, Rithwik Gupta, Divij Motwani, Yi Zhang, Diji Yang

PDF

1 Datasets 2 Videos

TL;DR

This paper introduces RAGuard, a new benchmark dataset based on Reddit discussions, to evaluate the robustness of retrieval-augmented generation systems against misleading and conflicting evidence in real-world scenarios.

Contribution

It presents the first fact-checking dataset that captures naturally occurring misinformation to assess RAG systems' resilience to misleading retrievals.

Findings

01

All tested RAG systems perform worse than zero-shot baselines under misleading retrievals.

02

Human annotators outperform RAG systems in handling misleading evidence.

03

RAG systems are highly susceptible to noisy and misleading information.

Abstract

Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to maintain consistent reasoning when exposed to misleading or conflicting evidence, especially in real-world domains such as politics, where information is polarized or selectively framed. Mainstream RAG benchmarks evaluate models under clean retrieval settings, where systems generate answers from gold-standard documents, or under synthetically perturbed settings, where documents are artificially injected with noise. These assumptions fail to reflect real-world conditions, often leading to an overestimation of RAG system performance. To address this gap, we introduce RAGuard, the first benchmark to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

UCSC-IRKM/RAGuard
dataset· 155 dl
155 dl

Videos

Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals· slideslive

Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals· underline

Taxonomy

MethodsAttention Is All You Need · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Layer Normalization · Byte Pair Encoding · WordPiece · Dense Connections · Attention Dropout · Residual Connection