Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Yanbo Dai; Zhenlan Ji; Zongjie Li; Kuan Li; Shuai Wang

arXiv:2508.20083·cs.CR·August 28, 2025

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Yanbo Dai, Zhenlan Ji, Zongjie Li, Kuan Li, Shuai Wang

PDF

TL;DR

This paper introduces DisarmRAG, a novel poisoning method targeting retrievers in retrieval-augmented generation systems to bypass self-correction, achieving high attack success rates while remaining stealthy.

Contribution

It proposes a new retriever poisoning paradigm with contrastive learning and co-optimization to bypass self-correction in RAG systems, a significant advancement over prior knowledge base attacks.

Findings

01

Achieves over 90% attack success rate across multiple LLMs and benchmarks.

02

Successfully bypasses self-correction mechanisms in RAG systems.

03

Remains stealthy against detection methods, highlighting new security challenges.

Abstract

Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.