SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Josue Torres-Fonseca; Naihao Deng; Yinpei Dai; Shane Storks; Yichi Zhang; Rada Mihalcea; Casey Kennington; Joyce Chai

arXiv:2604.19638·cs.AI·April 22, 2026

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai

PDF

1 Repo

TL;DR

SafetyALFRED introduces a benchmark for evaluating multimodal large language models' ability to recognize and actively mitigate safety hazards in embodied environments, highlighting the gap between hazard recognition and mitigation.

Contribution

We present SafetyALFRED, a new benchmark with real-world hazards and active safety evaluation, revealing models' limitations in safety-critical embodied planning tasks.

Findings

01

Models recognize hazards well in QA but struggle with mitigation in embodied settings.

02

Static hazard recognition does not ensure effective safety mitigation in physical environments.

03

Benchmark promotes development of models capable of proactive safety actions.

Abstract

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sled-group/SafetyALFRED.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.