OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences
Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma

TL;DR
This paper introduces OOD-MMSafe, a benchmark for evaluating MLLMs on consequence-driven safety, and proposes CASPO, a framework that improves models' ability to identify latent hazards and avoid static alignment pitfalls.
Contribution
It presents a new benchmark for consequence safety in MLLMs and a novel training framework that enhances hazard detection and reasoning capabilities.
Findings
Models exhibit high failure rates in hazard identification, especially in high-capacity closed-source models.
CASPO significantly reduces failure ratios in risk identification tasks.
Static alignment leads to format-centric failures, not improved safety reasoning.
Abstract
While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
