HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Jiayue Pu; Zhongxiang Sun; Zilu Zhang; Xiao Zhang; Jun Xu

arXiv:2603.11975·cs.CV·March 16, 2026

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu

PDF

Open Access

TL;DR

This paper introduces HomeSafe-Bench, a comprehensive benchmark for evaluating vision-language models in detecting unsafe actions in household scenarios, and proposes HD-Guard, a hierarchical system for real-time safety monitoring of household robots.

Contribution

It presents a new benchmark dataset for dynamic unsafe action detection in household environments and introduces a hierarchical architecture for real-time safety monitoring.

Findings

01

HD-Guard balances inference speed and accuracy effectively.

02

Current VLMs have significant bottlenecks in safety detection.

03

HomeSafe-Bench provides diverse, fine-grained safety scenarios.

Abstract

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Human Pose and Action Recognition