Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

Ying Chen; Lihuang Fang; Rui Jiang; Mingxu Wang; Zhifeng Gu; Lei Yi; Jie Chen

arXiv:2605.08747·cs.AI·May 15, 2026

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

Ying Chen, Lihuang Fang, Rui Jiang, Mingxu Wang, Zhifeng Gu, Lei Yi, Jie Chen

PDF

TL;DR

The paper introduces VIGIL, a new evaluation framework for embodied agents that separately measures world completion and terminal commitment, enabling clearer assessment of task success and failure modes.

Contribution

VIGIL provides a protocol to independently evaluate terminal commitment and world completion, distinguishing different failure types in embodied agent benchmarks.

Findings

01

VIGIL yields two scores: world-state completion and benchmark success.

02

Models with similar world completion can differ significantly in success due to commitment issues.

03

Action-feedback improves world completion but does not fully resolve commitment failures.

Abstract

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.