ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin; Philip Li; Zhichen Zeng; Tingwei Li; Tianxin Wei; Xuying Ning; Gaotang Li; Yuzhong Chen; Hanghang Tong

arXiv:2601.03600·cs.LG·January 8, 2026

ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li, Zhichen Zeng, Tingwei Li, Tianxin Wei, Xuying Ning, Gaotang Li, Yuzhong Chen, Hanghang Tong

PDF

Open Access

TL;DR

This paper introduces ALERT, a novel zero-shot jailbreak detection method for large language models that amplifies internal discrepancies to identify safety violations without prior attack templates.

Contribution

The paper proposes a new amplification framework and ALERT detector that effectively identifies zero-shot jailbreaks by leveraging internal feature discrepancies in LLMs.

Findings

01

ALERT outperforms existing methods across multiple benchmarks.

02

It achieves at least 10% higher accuracy and F1-score on average.

03

ALERT reliably ranks among the top two detection methods.

Abstract

Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Occupational Health and Safety Research