AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks
Yuqi Jia, Ruiqi Wang, Xilong Wang, Chong Xiang, Neil Gong

TL;DR
AlignSentinel is a novel three-class classifier that detects prompt injection attacks by considering instruction alignment, significantly improving detection accuracy over existing methods that misclassify benign aligned instructions as malicious.
Contribution
This work introduces AlignSentinel, the first classifier to distinguish among aligned, misaligned, and non-instruction inputs, utilizing attention-based features and a new benchmark dataset.
Findings
AlignSentinel accurately detects misaligned instructions.
It outperforms existing detection baselines.
The benchmark includes diverse input categories.
Abstract
Prompt injection attacks insert malicious instructions into an LLM's input to steer it toward an attacker-chosen task instead of the intended one. Existing detection defenses typically classify any input with instruction as malicious, leading to misclassification of benign inputs containing instructions that align with the intended task. In this work, we account for the instruction hierarchy and distinguish among three categories: inputs with misaligned instructions, inputs with aligned instructions, and non-instruction inputs. We introduce AlignSentinel, a three-class classifier that leverages features derived from LLM's attention maps to categorize inputs accordingly. To support evaluation, we construct the first systematic benchmark containing inputs from all three categories. Experiments on both our benchmark and existing ones--where inputs with aligned instructions are largely…
Peer Reviews
Decision·Submitted to ICLR 2026
- Incorporating an aligned instruction category into the training of the defense is a natural way to reduce false positives on benign inputs - Internal signals such as attention are potentially strong indicators of malicious input as shown by prior work [1, 2] - AlignSentinel shows strong results in the author's experiments (however, I have concerns regarding reliability, see Weaknesses) [1] Hung et al. Attention Tracker: Detecting prompt injection attacks in LLMs. 2024. [2] Choudhary et al. T
- The reliability of the exceedingly strong results (near perfect results in terms of FPR and FNR) in the author's experiments is unclear. In particular, it is unclear whether these strong results are due directly to the ability of AlignSentinel to distinguish unaligned instructions from benign inputs, as suggested by the authors. The authors use the same pipeline to produce training examples for the clasifier and (with a domain shift) for the test set of attacks and benign samples, which may en
- Novel problem formulation that addresses a critical limitation of existing prompt injection detectors by explicitly accounting for instruction hierarchy, reducing false positives on benign but instruction-containing inputs. - Strong theoretical motivation for using attention maps as detection signals - New benchmark which will be useful for the community - Thorough evaluation - Proposes two variants and explores their differences
- I would appreciate more discussion of limitations/potential weaknesses. Do the authors think AlignSentinal is easy to beat? Could an attacker trick the model's idea of task alignment? - I think it would be stronger if the authors discussed more about why baselines seem especially weak. It makes sense that since AlignSentinal is developed in response to this alignment problem it is much stronger, but some of the baselines seem almost mistrained given their extremely high FNR. - It would help to
- The explicit distinction between aligned and misaligned instructions is convincing and addresses an obvious gap in existing work on instruction-data separation. This contribution is timely and already relevant for current agentic systems that, for example, generate execution plans which should be treated as aligned instrumental goals rather than hijacking attempts (in comparison to the original goal). - The authors clearly define what they mean by aligned instructions, misaligned instructions
- It would be valuable if the authors made an explicit connection to the instruction-data separation literature [1, 2, 3, 4] and integrated the notion of data into their model (which currently appears to be framed as "non-instructional inputs"). - In lines 37-45, the authors informally introduce aligned and misaligned instructions using an email agent example. They argue that some instructions from fetched emails should be considered aligned instructions. However, this seems like a clear instru
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · Cryptographic Implementations and Security
