SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models
Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, Yu Tian

TL;DR
SafeSteer is a decoding-level defense mechanism for multimodal large language models that enhances safety by detecting and correcting harmful outputs during decoding without fine-tuning.
Contribution
It introduces SafeSteer, a novel decoding-stage safety method that leverages intrinsic safety capabilities and modal semantic alignment to improve MLLMs' safety.
Findings
SafeSteer improves MLLMs' safety by up to 33.40%.
It maintains model helpfulness while reducing harmful outputs.
Image-based attacks are more stealthy, but SafeSteer effectively counters them.
Abstract
Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
