Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning
Yanbo Dai, Zhenlan Ji, Zongjie Li, Kuan Li, Shuai Wang

TL;DR
This paper introduces DisarmRAG, a novel poisoning method targeting retrievers in retrieval-augmented generation systems to bypass self-correction, achieving high attack success rates while remaining stealthy.
Contribution
It proposes a new retriever poisoning paradigm with contrastive learning and co-optimization to bypass self-correction in RAG systems, a significant advancement over prior knowledge base attacks.
Findings
Achieves over 90% attack success rate across multiple LLMs and benchmarks.
Successfully bypasses self-correction mechanisms in RAG systems.
Remains stealthy against detection methods, highlighting new security challenges.
Abstract
Retrieval-Augmented Generation (RAG) has become a standard approach for improving the reliability of large language models (LLMs). Prior work demonstrates the vulnerability of RAG systems by misleading them into generating attacker-chosen outputs through poisoning the knowledge base. However, this paper uncovers that such attacks could be mitigated by the strong \textit{self-correction ability (SCA)} of modern LLMs, which can reject false context once properly configured. This SCA poses a significant challenge for attackers aiming to manipulate RAG systems. In contrast to previous poisoning methods, which primarily target the knowledge base, we introduce \textsc{DisarmRAG}, a new poisoning paradigm that compromises the retriever itself to suppress the SCA and enforce attacker-chosen outputs. This compromisation enables the attacker to straightforwardly embed anti-SCA instructions into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
