Robust Multimodal Safety via Conditional Decoding
Anurag Kumar, Raghuveer Peri, Jon Burnsky, Alexandru Nelus, Rohit Paturi, Srikanth Vishnubhotla, Yanjun Qi

TL;DR
This paper introduces CASA, a simple safety mechanism for multimodal large-language models that significantly reduces attack success rates while maintaining performance on benign inputs.
Contribution
CASA is a novel conditional decoding strategy that enhances safety detection in MLLMs without external classifiers or modality-specific fine-tuning.
Findings
CASA reduces attack success rate by over 97% across multiple modalities.
CASA maintains strong utility on benign inputs, validated by human and automated evaluations.
CASA is effective against diverse attack types and benchmarks.
Abstract
Multimodal large-language models (MLLMs) often experience degraded safety alignment when harmful queries exploit cross-modal interactions. Models aligned on text alone show a higher rate of successful attacks when extended to two or more modalities. In this work, we propose a simple conditional decoding strategy, CASA (Classification Augmented with Safety Attention) that utilizes internal representations of MLLMs to predict a binary safety token before response generation. We introduce a novel safety attention module designed to enhance the model's ability to detect malicious queries. Our design ensures robust safety alignment without relying on any external classifier or auxiliary head, and without the need for modality-specific safety fine-tuning. On diverse benchmarks such as MM-SafetyBench, JailbreakV-28k, and adversarial audio tests, CASA lowers the average attack success rate by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
