Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
Zhihua Wei, Qiang Li, Jian Ruan, Zhenxin Qin, Leilei Wen, Dongrui Liu, Wen Shen

TL;DR
This paper investigates how visual inputs cause large vision-language models to shift representations toward jailbreak states, leading to safety failures, and proposes a defense method that removes this shift to improve safety.
Contribution
The paper introduces the concept of jailbreak-related representation shift, identifies a specific shift direction, and proposes a defense method to mitigate jailbreaks in VLMs.
Findings
Jailbreak samples form a distinct, separable internal state.
The identified jailbreak shift reliably characterizes jailbreak behavior.
The proposed defense method effectively reduces jailbreak success while maintaining benign performance.
Abstract
Large vision-language models (VLMs) often exhibit weakened safety alignment with the integration of the visual modality. Even when text prompts contain explicit harmful intent, adding an image can substantially increase jailbreak success rates. In this paper, we observe that VLMs can clearly distinguish benign inputs from harmful ones in their representation space. Moreover, even among harmful inputs, jailbreak samples form a distinct internal state that is separable from refusal samples. These observations suggest that jailbreaks do not arise from a failure to recognize harmful intent. Instead, the visual modality shifts representations toward a specific jailbreak state, thereby leading to a failure to trigger refusal. To quantify this transition, we identify a jailbreak direction and define the jailbreak-related shift as the component of the image-induced representation shift along…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
