TL;DR
CleanBase is a novel method that detects malicious documents in RAG knowledge bases by identifying highly similar document clusters, thereby preventing prompt injection attacks.
Contribution
It introduces a similarity graph approach to detect malicious documents based on their high semantic similarity and clique formation, with theoretical and empirical validation.
Findings
CleanBase accurately detects malicious documents across multiple datasets.
The method effectively prevents prompt injection attacks in RAG systems.
Theoretical bounds on false positive and false negative rates are established.
Abstract
Retrieval-augmented generation (RAG) is vulnerable to prompt injection attacks, in which an adversary inserts malicious documents containing carefully crafted injected prompts into the knowledge database. When a user issues a question targeted by the attack, the RAG system may retrieve these malicious documents, whose injected prompts mislead it into generating attacker-specified answers, thereby compromising the integrity of the RAG system. In this work, we propose CleanBase, a method to detect malicious documents within a knowledge database. Our key insight is that malicious documents crafted for the same attack-targeted questions often exhibit high semantic similarity, as attackers deliberately make them consistent to improve attack success rates. Accordingly, CleanBase constructs a similarity graph over the knowledge database, where each node represents a document and an edge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
