Adversarial Decoding: Generating Readable Documents for Adversarial Objectives
Collin Zhang, Tingwei Zhang, Vitaly Shmatikov

TL;DR
This paper introduces adversarial decoding, a versatile text generation method that creates readable adversarial documents capable of evading filters and influencing retrieval-augmented generation systems, surpassing existing techniques.
Contribution
It presents a novel, generic decoding approach that handles complex adversarial objectives, including embedding similarity, and produces more effective, readable adversarial texts than prior methods.
Findings
Outperforms existing methods in producing readable adversarial documents
Effective against RAG poisoning, jailbreaking, and filter evasion
Generates documents that influence retrieval and subsequent generation processes
Abstract
We design, implement, and evaluate adversarial decoding, a new, generic text generation technique that produces readable documents for different adversarial objectives. Prior methods either produce easily detectable gibberish, or cannot handle objectives that include embedding similarity. In particular, they only work for direct attacks (such as jailbreaking) and cannot produce adversarial text for realistic indirect injection, e.g., documents that (1) are retrieved in RAG systems in response to broad classes of queries, and also (2) adversarially influence subsequent generation. We also show that fluency (low perplexity) is not sufficient to evade filtering. We measure the effectiveness of adversarial decoding for different objectives, including RAG poisoning, jailbreaking, and evasion of defensive filters, and demonstrate that it outperforms existing methods while producing readable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Media Forensic Detection
MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Attention Dropout · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Linear Warmup With Linear Decay · Residual Connection · WordPiece
