A Critical Evaluation of Defenses against Prompt Injection Attacks
Yuqi Jia, Zedian Shao, Yupei Liu, Jinyuan Jia, Dawn Song, Neil Zhenqiang Gong

TL;DR
This paper critically evaluates existing defenses against prompt injection attacks on LLMs, revealing that many are less effective than previously claimed when assessed with a comprehensive, principled methodology.
Contribution
It introduces a rigorous evaluation framework for defenses against prompt injection, highlighting gaps in prior assessments and guiding future defense development.
Findings
Existing defenses are less effective than previously claimed.
Many defenses compromise LLM utility under evaluation.
The paper proposes a comprehensive evaluation methodology.
Abstract
Large Language Models (LLMs) are vulnerable to prompt injection attacks, and several defenses have recently been proposed, often claiming to mitigate these attacks successfully. However, we argue that existing studies lack a principled approach to evaluating these defenses. In this paper, we argue the need to assess defenses across two critical dimensions: (1) effectiveness, measured against both existing and adaptive prompt injection attacks involving diverse target and injected prompts, and (2) general-purpose utility, ensuring that the defense does not compromise the foundational capabilities of the LLM. Our critical evaluation reveals that prior studies have not followed such a comprehensive evaluation methodology. When assessed using this principled approach, we show that existing defenses are not as successful as previously reported. This work provides a foundation for evaluating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques
