Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li; Chenxiao Yu; Zhiyu Ni; Hao Li; Charith Peris; Chaowei Xiao; Yue Zhao

arXiv:2601.07185·cs.CR·January 13, 2026

Defenses Against Prompt Attacks Learn Surface Heuristics

Shawn Li, Chenxiao Yu, Zhiyu Ni, Hao Li, Charith Peris, Chaowei Xiao, Yue Zhao

PDF

Open Access

TL;DR

This paper reveals that current defenses against prompt attacks in large language models rely on superficial surface heuristics, leading to high false rejection rates and poor generalization, and introduces diagnostic datasets to evaluate these limitations.

Contribution

The study systematically analyzes shortcut behaviors in defenses and provides diagnostic datasets to evaluate the robustness of prompt attack defenses in LLMs.

Findings

01

Position bias causes high rejection rates for benign content placed later in prompts.

02

Trigger tokens increase false refusals significantly.

03

Defended models show poor generalization, with accuracy drops up to 40%.

Abstract

Large language models (LLMs) are increasingly deployed in security-sensitive applications, where they must follow system- or developer-specified instructions that define the intended task behavior, while completing benign user requests. When adversarial instructions appear in user queries or externally retrieved content, models may override intended logic. Recent defenses rely on supervised fine-tuning with benign and malicious labels. Although these methods achieve high attack rejection rates, we find that they rely on narrow correlations in defense data rather than harmful intent, leading to systematic rejection of safe inputs. We analyze three recurring shortcut behaviors induced by defense fine-tuning. \emph{Position bias} arises when benign content placed later in a prompt is rejected at much higher rates; across reasoning benchmarks, suffix-task rejection rises from below…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Explainable Artificial Intelligence (XAI)