ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification
Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, Hanghang Tong

TL;DR
This paper introduces ALERT, a novel zero-shot jailbreak detection method for large language models that amplifies internal discrepancies to identify safety violations without prior attack templates.
Contribution
The paper proposes a new amplification framework and ALERT detector that effectively identifies zero-shot jailbreaks by leveraging internal feature discrepancies in LLMs.
Findings
ALERT outperforms existing methods across multiple benchmarks.
It achieves at least 10% higher accuracy and F1-score on average.
ALERT reliably ranks among the top two detection methods.
Abstract
Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Occupational Health and Safety Research
