Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction
Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao, Yanyan Zhao, Yutai Hou, Qianchao Wang, Dandan Tu, Bing Qin

TL;DR
This paper identifies a safety gap in multimodal large language models caused by modality-induced drift, and proposes a training-free correction method called ReGap that enhances safety without sacrificing utility.
Contribution
It introduces the concept of Safety Geometry Collapse, analyzes its causes, and develops ReGap, a novel inference-time correction technique for improving multimodal safety.
Findings
Counteracting modality-induced drift restores refusal separability.
ReGap significantly improves safety benchmarks.
Self-rectification enables the model to recognize harmful inputs during forward pass.
Abstract
Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
