Understanding and Rectifying Safety Perception Distortion in VLMs

Xiaohan Zou; Jian Kang; George Kesidis; Lu Lin

arXiv:2502.13095·cs.CV·February 19, 2025

Understanding and Rectifying Safety Perception Distortion in VLMs

Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin

PDF

Open Access

TL;DR

This paper investigates why vision-language models overestimate safety and proposes a training-free method, ShiftDC, to calibrate and reduce safety perception distortion, improving safety alignment without harming utility.

Contribution

The paper identifies modality-induced activation shift as the cause of safety perception distortion and introduces ShiftDC, a novel calibration method that restores safety alignment in VLMs.

Findings

01

ShiftDC effectively reduces safety perception distortion.

02

ShiftDC improves safety benchmark performance.

03

ShiftDC maintains vision-language capabilities.

Abstract

Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsElevator Systems and Control