How Not to Detect Prompt Injections with an LLM
Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, Somesh Jha

TL;DR
This paper critically analyzes a defense mechanism against prompt injection attacks in LLMs, revealing a fundamental vulnerability and demonstrating an effective adaptive attack that bypasses existing defenses.
Contribution
It formally characterizes the KAD defense scheme, uncovers its structural vulnerability, and introduces DataFlip, an adaptive attack that reliably evades KAD defenses.
Findings
KAD scheme has a structural vulnerability.
DataFlip evades KAD with 0% detection rate.
DataFlip successfully induces malicious behavior in 91% of cases.
Abstract
LLM-integrated applications and agents are vulnerable to prompt injection attacks, where adversaries embed malicious instructions within seemingly benign input data to manipulate the LLM's intended behavior. Recent defenses based on known-answer detection (KAD) scheme have reported near-perfect performance by observing an LLM's output to classify input data as clean or contaminated. KAD attempts to repurpose the very susceptibility to prompt injection as a defensive mechanism. We formally characterize the KAD scheme and uncover a structural vulnerability that invalidates its core security premise. To exploit this fundamental vulnerability, we methodically design an adaptive attack, DataFlip. It consistently evades KAD defenses, achieving detection rates as low as while reliably inducing malicious behavior with a success rate of , all without requiring white-box access to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Cryptographic Implementations and Security
