The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?
Manish Bhatt, Sarthak Munshi, Vineeth Sai Narajala, Idan Habler, Ammar Al-Kahfah, Ken Huang, Joel Webb, Blake Gatto, Md Tamjidul Hoque

TL;DR
This paper proves fundamental limitations of continuous, utility-preserving prompt injection defenses for language models, establishing a trilemma that such defenses cannot simultaneously achieve safety, utility, and completeness.
Contribution
It introduces a formal framework demonstrating the inherent failure of certain defense wrappers, extending the results to various settings and verifying the theory mechanically and empirically.
Findings
No continuous, utility-preserving wrapper can make all outputs strictly safe.
A positive-measure unsafe region persists under certain conditions.
The results are validated both mechanically in Lean 4 and empirically on three LLMs.
Abstract
We prove that no continuous, utility-preserving wrapper defense-a function that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an -robust constraint-under Lipschitz regularity, a positive-measure band around fixed boundary points remains near-threshold; and a persistent unsafe region under a transversality condition, a positive-measure subset of inputs remains strictly unsafe. These constitute a defense trilemma: continuity, utility preservation, and completeness cannot coexist. We prove parallel discrete results requiring no topology, and extend to multi-turn interactions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
