AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks
Weiming Song, Xuan Xie, Ruiping Yin

TL;DR
AISA is a lightweight, single-pass method that enhances large language model safety by detecting and steering away from risky prompts using intrinsic signals, without retraining or external guardrails.
Contribution
AISA introduces a novel intrinsic safety detection and steering mechanism that operates in a single pass, avoiding costly fine-tuning or external safety modules.
Findings
Achieves detector-level performance on small models
Improves safety robustness across multiple datasets and models
Reduces false refusals while maintaining utility
Abstract
Large language models (LLMs) remain vulnerable to jailbreak prompts that elicit harmful or policy-violating outputs, while many existing defenses rely on expensive fine-tuning, intrusive prompt rewriting, or external guardrails that add latency and can degrade helpfulness. We present AISA, a lightweight, single-pass defense that activates safety behaviors already latent inside the model rather than treating safety as an add-on. AISA first localizes intrinsic safety awareness via spatiotemporal analysis and shows that intent-discriminative signals are broadly encoded, with especially strong separability appearing in the scaled dot-product outputs of specific attention heads near the final structural tokens before generation. Using a compact set of automatically selected heads, AISA extracts an interpretable prompt-risk score with minimal overhead, achieving detector-level performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Advanced Malware Detection Techniques
