Unraveling and Mitigating Safety Alignment Degradation of   Vision-Language Models

Qin Liu; Chao Shang; Ling Liu; Nikolaos Pappas; Jie Ma; Neha Anna; John; Srikanth Doss; Lluis Marquez; Miguel Ballesteros; Yassine Benajiba

arXiv:2410.09047·cs.CL·October 14, 2024

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

Qin Liu, Chao Shang, Ling Liu, Nikolaos Pappas, Jie Ma, Neha Anna, John, Srikanth Doss, Lluis Marquez, Miguel Ballesteros, Yassine Benajiba

PDF

Open Access 1 Video

TL;DR

This paper identifies the cause of safety alignment degradation in vision-language models due to representation gaps and proposes an inference-time intervention method, CMRM, to significantly improve safety without retraining.

Contribution

The paper introduces Cross-Modality Representation Manipulation (CMRM), a novel inference-time method to recover safety alignment in VLMs by addressing representation gaps caused by vision modality integration.

Findings

01

CMRM reduces unsafe responses from 61.53% to 3.15% in LLaVA-7B.

02

The method preserves the linguistic and functional capabilities of VLMs.

03

Safety alignment can be significantly improved without additional training.

Abstract

The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ''safety alignment degradation'' in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models· underline

Taxonomy

TopicsSafety Systems Engineering in Autonomy · Software Reliability and Analysis Research · Human-Automation Interaction and Safety