IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu; Zeren Chen; Xuhao Hu; Yijin Zhou; Weichen Zhang; Dongrui Liu; Lu Sheng; Jing Shao

arXiv:2506.16402·cs.AI·December 8, 2025

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

PDF

Open Access 1 Datasets 1 Video

TL;DR

IS-Bench is a novel multi-modal benchmark designed to evaluate the interactive safety of VLM-driven embodied agents in household tasks, focusing on their ability to perceive and mitigate emergent risks during task execution.

Contribution

This paper introduces IS-Bench, the first benchmark for assessing interactive safety in embodied agents, including a high-fidelity simulator and a process-oriented evaluation methodology.

Findings

01

Current agents lack interactive safety awareness.

02

Safety-aware Chain-of-Thought improves safety but may reduce task success.

03

Many agents fail to perform risk mitigation actions at critical steps.

Abstract

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Ursulalala/IS_Bench_dataset
dataset· 136 dl
136 dl

Videos

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks· underline

Taxonomy

TopicsSocial Robot Interaction and HRI · Human-Automation Interaction and Safety · Reinforcement Learning in Robotics