Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models
Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen, Xueqi Cheng

TL;DR
This paper investigates why safety mechanisms in large language models do not transfer well to vision inputs in vision-language models, analyzes the underlying causes, and proposes a novel text-guided alignment method to improve safety transfer without compromising performance.
Contribution
It introduces TGA, a novel text-guided alignment method that effectively transfers safety mechanisms from text to vision in LVLMs without additional safety fine-tuning.
Findings
TGA successfully transfers safety mechanisms from text to vision in LVLMs.
TGA maintains performance on various vision tasks.
Current alignment methods cause semantic shifts that hinder safety transfer.
Abstract
Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. However, we find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic image. To explore the cause of this problem, we give the insightful explanation of where and how the safety mechanism of LVLMs operates and conduct comparative analysis between text and vision. We find that the hidden states at the specific transformer layers play a crucial role in the successful activation of safety mechanism, while the vision-language alignment at hidden states level in current methods is insufficient. This results in a semantic shift for input images compared to text in hidden states, therefore misleads the safety mechanism. To address this, we propose a novel Text-Guided…
Peer Reviews
Decision·ICLR 2025 Poster
1. The authors first analyze cause of failure in cross-modal safety transfer. Based on the analysis, they propose Text-Guided Alignment (TGA) to transfer safety mechanisms from text to vision, addressing key safety issues in LVLMs. The analysis is thorough and the proposed method is novel in general. 2. The paper is well-structured, with clear motivations and systematic explanations of the issues with current vision-language alignment methods. 3. The proposed approach contributes to improving th
1. TGA relies on captions generated by LLaVA-1.5-13B for effective alignment. Inaccurate captions can lead to misalignment between vision and language representations, reducing safety performance. Evaluating the impact of captioning errors and exploring mitigation strategies could add robustness to the approach. 2. The paper does not adequately show how the model handles unsafe compositional inputs. For instance, an image of a wine bottle combined with text like "teach a kid to buy this" represe
This paper is well-motivated and provides a thorough analysis of layer activations to explain the safety misalignment between vision and language. The work has potential value across multiple related fields, particularly in the design of vision-language models and their safety challenges. The method for identifying the layers where the safety mechanism is activated is both reasonable and straightforward, showing effectiveness with a simple approach. The proposed TGA alignment method effectivel
The paper lacks comparisons with other defense methods. Aside from the comparison with the unlearn-FigS defense, the current experimental results are mainly contrasted with the original model. Including comparisons with existing safety defense methods, such as [1-2], would provide stronger evidence of the proposed approach's superiority. The presentation is somewhat redundant. For instance, the content in Figures 2 and 4, as well as Figures 3 and 5, could be combined to avoid repetition. Simila
Pros: 1. The paper tackles an interesting problem which (to best of my knowledge) isn't very well known in the community. As such, it highlights a potential gap and suggests how to fix new VLMs. 2. The motivation is a bit subtle and it is important to note is mostly relevant for open-source models. In a closed sourced model, one could simply have a nsfw classifier on the image-input. However, for open-source model, such an additional component can be easily turned off. As such, a method to ha
Cons: 1. One thing that isn't clear to me is if it is possible to reverse the trained safety filter by doing an instruction tuning on a sample of toxic dataset by an end user. In that case, it would be easy to "jailbreak" the safe model with relative ease. 2. The authors should include a baseline which works as a direct filter on the image itself to get an upper bound estimate.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Safety Analysis · Topic Modeling · Advanced Data Processing Techniques
