OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Ming Wen; Kun Yang; Jingyu Zhang; Yuxuan Liu; shiwen cui; Shouling Ji; Xingjun Ma

arXiv:2603.09706·cs.AI·March 11, 2026

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Ming Wen, Kun Yang, Jingyu Zhang, Yuxuan Liu, shiwen cui, Shouling Ji, Xingjun Ma

PDF

Open Access

TL;DR

This paper introduces OOD-MMSafe, a benchmark for evaluating MLLMs on consequence-driven safety, and proposes CASPO, a framework that improves models' ability to identify latent hazards and avoid static alignment pitfalls.

Contribution

It presents a new benchmark for consequence safety in MLLMs and a novel training framework that enhances hazard detection and reasoning capabilities.

Findings

01

Models exhibit high failure rates in hazard identification, especially in high-capacity closed-source models.

02

CASPO significantly reduces failure ratios in risk identification tasks.

03

Static alignment leads to format-centric failures, not improved safety reasoning.

Abstract

While safety alignment for Multimodal Large Language Models (MLLMs) has gained significant attention, current paradigms primarily target malicious intent or situational violations. We propose shifting the safety frontier toward consequence-driven safety, a paradigm essential for the robust deployment of autonomous and embodied agents. To formalize this shift, we introduce OOD-MMSafe, a benchmark comprising 455 curated query-image pairs designed to evaluate a model's ability to identify latent hazards within context-dependent causal chains. Our analysis reveals a pervasive causal blindness among frontier models, with the highest 67.5% failure rate in high-capacity closed-source models, and identifies a preference ceiling where static alignment yields format-centric failures rather than improved safety reasoning as model capacity grows. To address these bottlenecks, we develop the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques