Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Shuang Liang; Zhihao Xu; Jiaqi Weng; Jialing Tao; Hui Xue; Xiting Wang

arXiv:2508.09201·cs.CR·January 28, 2026

Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models

Shuang Liang, Zhihao Xu, Jiaqi Weng, Jialing Tao, Hui Xue, Xiting Wang

PDF

Open Access

TL;DR

This paper introduces LoD, a novel learnable framework that effectively detects unseen jailbreak attacks in large vision-language models by leveraging internal model representations, achieving state-of-the-art performance without attack data or heuristics.

Contribution

LoD is the first learnable, attack-agnostic detection framework that uses internal activations for identifying unseen jailbreak attacks in LVLMs, improving accuracy and efficiency.

Findings

01

Achieves state-of-the-art AUROC across various unseen attacks

02

Operates without attack data or hand-crafted heuristics

03

Significantly improves detection efficiency

Abstract

Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model's internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Anomaly Detection Techniques and Applications