Defenses Against Prompt Attacks Learn Surface Heuristics
Shawn Li, Chenxiao Yu, Zhiyu Ni, Hao Li, Charith Peris, Chaowei Xiao, Yue Zhao

TL;DR
This paper reveals that current defenses against prompt attacks in large language models rely on superficial surface heuristics, leading to high false rejection rates and poor generalization, and introduces diagnostic datasets to evaluate these limitations.
Contribution
The study systematically analyzes shortcut behaviors in defenses and provides diagnostic datasets to evaluate the robustness of prompt attack defenses in LLMs.
Findings
Position bias causes high rejection rates for benign content placed later in prompts.
Trigger tokens increase false refusals significantly.
Defended models show poor generalization, with accuracy drops up to 40%.
Abstract
Large language models (LLMs) are increasingly deployed in security-sensitive applications, where they must follow system- or developer-specified instructions that define the intended task behavior, while completing benign user requests. When adversarial instructions appear in user queries or externally retrieved content, models may override intended logic. Recent defenses rely on supervised fine-tuning with benign and malicious labels. Although these methods achieve high attack rejection rates, we find that they rely on narrow correlations in defense data rather than harmful intent, leading to systematic rejection of safe inputs. We analyze three recurring shortcut behaviors induced by defense fine-tuning. \emph{Position bias} arises when benign content placed later in a prompt is rejected at much higher rates; across reasoning benchmarks, suffix-task rejection rises from below…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Explainable Artificial Intelligence (XAI)
