Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu, Chao Shang, Ling Liu, Nikolaos Pappas, Jie Ma, Neha Anna, John, Srikanth Doss, Lluis Marquez, Miguel Ballesteros, Yassine Benajiba

TL;DR
This paper identifies the cause of safety alignment degradation in vision-language models due to representation gaps and proposes an inference-time intervention method, CMRM, to significantly improve safety without retraining.
Contribution
The paper introduces Cross-Modality Representation Manipulation (CMRM), a novel inference-time method to recover safety alignment in VLMs by addressing representation gaps caused by vision modality integration.
Findings
CMRM reduces unsafe responses from 61.53% to 3.15% in LLaVA-7B.
The method preserves the linguistic and functional capabilities of VLMs.
Safety alignment can be significantly improved without additional training.
Abstract
The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ''safety alignment degradation'' in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Software Reliability and Analysis Research · Human-Automation Interaction and Safety
