How Not to Detect Prompt Injections with an LLM

Sarthak Choudhary; Divyam Anshumaan; Nils Palumbo; Somesh Jha

arXiv:2507.05630·cs.CR·December 9, 2025

How Not to Detect Prompt Injections with an LLM

Sarthak Choudhary, Divyam Anshumaan, Nils Palumbo, Somesh Jha

PDF

Open Access

TL;DR

This paper critically analyzes a defense mechanism against prompt injection attacks in LLMs, revealing a fundamental vulnerability and demonstrating an effective adaptive attack that bypasses existing defenses.

Contribution

It formally characterizes the KAD defense scheme, uncovers its structural vulnerability, and introduces DataFlip, an adaptive attack that reliably evades KAD defenses.

Findings

01

KAD scheme has a structural vulnerability.

02

DataFlip evades KAD with 0% detection rate.

03

DataFlip successfully induces malicious behavior in 91% of cases.

Abstract

LLM-integrated applications and agents are vulnerable to prompt injection attacks, where adversaries embed malicious instructions within seemingly benign input data to manipulate the LLM's intended behavior. Recent defenses based on known-answer detection (KAD) scheme have reported near-perfect performance by observing an LLM's output to classify input data as clean or contaminated. KAD attempts to repurpose the very susceptibility to prompt injection as a defensive mechanism. We formally characterize the KAD scheme and uncover a structural vulnerability that invalidates its core security premise. To exploit this fundamental vulnerability, we methodically design an adaptive attack, DataFlip. It consistently evades KAD defenses, achieving detection rates as low as $0%$ while reliably inducing malicious behavior with a success rate of $91%$ , all without requiring white-box access to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Cryptographic Implementations and Security