Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
Peichun Hua, Hao Li, Shanghao Shi, Zhiyuan Yu, Ning Zhang

TL;DR
This paper introduces Representational Contrastive Scoring (RCS), a novel method for detecting multimodal jailbreak attacks on large vision-language models by analyzing their internal representations.
Contribution
It proposes RCS, a lightweight, interpretable framework that improves generalization and reliability of jailbreak detection using contrastive scoring on internal model representations.
Findings
RCS outperforms existing methods on unseen attack types.
MCD and KCD achieve state-of-the-art detection performance.
The approach is efficient and practical for real-world deployment.
Abstract
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse unseen benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
