FlippedRAG: Black-Box Opinion Manipulation Adversarial Attacks to Retrieval-Augmented Generation Models
Zhuo Chen, Yuyang Gong, Jiawei Liu, Miaokun Chen, Haotan Liu, Qikai Cheng, Fan Zhang, Wei Lu, Xiaozhong Liu

TL;DR
This paper introduces FlippedRAG, a transfer-based black-box adversarial attack method that manipulates retrieval-augmented generation models to alter opinions on controversial topics, revealing significant security vulnerabilities.
Contribution
We develop a novel attack framework that reverse-engineers the retriever and crafts poisoning triggers, demonstrating substantial effectiveness against black-box RAG models.
Findings
FlippedRAG increases attack success rate by 16.7%.
It causes a 50% shift in opinion polarity of generated responses.
Existing defenses are ineffective against FlippedRAG.
Abstract
Retrieval-Augmented Generation (RAG) enriches LLMs by dynamically retrieving external knowledge, reducing hallucinations and satisfying real-time information needs. While existing research mainly targets RAG's performance and efficiency, emerging studies highlight critical security concerns. Yet, current adversarial approaches remain limited, mostly addressing white-box scenarios or heuristic black-box attacks without fully investigating vulnerabilities in the retrieval phase. Additionally, prior works mainly focus on factoid Q&A tasks, their attacks lack complexity and can be easily corrected by advanced LLMs. In this paper, we investigate a more realistic and critical threat scenario: adversarial attacks intended for opinion manipulation against black-box RAG models, particularly on controversial topics. Specifically, we propose FlippedRAG, a transfer-based adversarial attack against…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Focus · Layer Normalization · Dense Connections · Attention Dropout · Softmax · Byte Pair Encoding · Linear Warmup With Linear Decay · WordPiece · Linear Layer
