HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

Xiaoya Lu; Yijin Zhou; Zeren Chen; Ruocheng Wang; Bingrui Sima; Enshen Zhou; Lu Sheng; Dongrui Liu; Jing Shao

arXiv:2603.14367·cs.CV·March 17, 2026

HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

Xiaoya Lu, Yijin Zhou, Zeren Chen, Ruocheng Wang, Bingrui Sima, Enshen Zhou, Lu Sheng, Dongrui Liu, Jing Shao

PDF

Open Access 1 Models 1 Datasets

TL;DR

HomeGuard introduces a novel architecture-agnostic safeguard using Context-Guided Chain-of-Thought to improve safety in embodied agents by accurately identifying contextual risks in household tasks.

Contribution

The paper proposes a new safeguard mechanism with CG-CoT that enhances risk detection and grounding in VLM-based embodied agents, supported by a curated dataset and reinforcement fine-tuning.

Findings

01

Risk match rates improved by over 30%

02

Enhanced safety and reduced oversafety in hazard detection

03

Generated visual anchors enable explicit collision avoidance

Abstract

Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Ursulalala/HomeGuard-8B
model· 120 dl· ♡ 1
120 dl♡ 1

Datasets

Ursulalala/HomeSafe
dataset· 29 dl
29 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Social Robot Interaction and HRI