TL;DR
SafetyALFRED introduces a benchmark for evaluating multimodal large language models' ability to recognize and actively mitigate safety hazards in embodied environments, highlighting the gap between hazard recognition and mitigation.
Contribution
We present SafetyALFRED, a new benchmark with real-world hazards and active safety evaluation, revealing models' limitations in safety-critical embodied planning tasks.
Findings
Models recognize hazards well in QA but struggle with mitigation in embodied settings.
Static hazard recognition does not ensure effective safety mitigation in physical environments.
Benchmark promotes development of models capable of proactive safety actions.
Abstract
Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
