Understanding and Rectifying Safety Perception Distortion in VLMs
Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin

TL;DR
This paper investigates why vision-language models overestimate safety and proposes a training-free method, ShiftDC, to calibrate and reduce safety perception distortion, improving safety alignment without harming utility.
Contribution
The paper identifies modality-induced activation shift as the cause of safety perception distortion and introduces ShiftDC, a novel calibration method that restores safety alignment in VLMs.
Findings
ShiftDC effectively reduces safety perception distortion.
ShiftDC improves safety benchmark performance.
ShiftDC maintains vision-language capabilities.
Abstract
Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety. By isolating and removing the safety-relevant component, ShiftDC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElevator Systems and Control
