Learning to Detect Unseen Jailbreak Attacks in Large Vision-Language Models
Shuang Liang, Zhihao Xu, Jiaqi Weng, Jialing Tao, Hui Xue, Xiting Wang

TL;DR
This paper introduces LoD, a novel learnable framework that effectively detects unseen jailbreak attacks in large vision-language models by leveraging internal model representations, achieving state-of-the-art performance without attack data or heuristics.
Contribution
LoD is the first learnable, attack-agnostic detection framework that uses internal activations for identifying unseen jailbreak attacks in LVLMs, improving accuracy and efficiency.
Findings
Achieves state-of-the-art AUROC across various unseen attacks
Operates without attack data or hand-crafted heuristics
Significantly improves detection efficiency
Abstract
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks. To mitigate these risks, existing detection methods are essential, yet they face two major challenges: generalization and accuracy. While learning-based methods trained on specific attacks fail to generalize to unseen attacks, learning-free methods based on hand-crafted heuristics suffer from limited accuracy and reduced efficiency. To address these limitations, we propose Learning to Detect (LoD), a learnable framework that eliminates the need for any attack data or hand-crafted heuristics. LoD operates by first extracting layer-wise safety representations directly from the model's internal activations using Multi-modal Safety Concept Activation Vectors classifiers, and then converting the high-dimensional representations into a one-dimensional anomaly score for detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Anomaly Detection Techniques and Applications
