Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Ying Chen, Lihuang Fang, Rui Jiang, Mingxu Wang, Zhifeng Gu, Lei Yi, Jie Chen

TL;DR
The paper introduces VIGIL, a new evaluation framework for embodied agents that separately measures world completion and terminal commitment, enabling clearer assessment of task success and failure modes.
Contribution
VIGIL provides a protocol to independently evaluate terminal commitment and world completion, distinguishing different failure types in embodied agent benchmarks.
Findings
VIGIL yields two scores: world-state completion and benchmark success.
Models with similar world completion can differ significantly in success due to commitment issues.
Action-feedback improves world completion but does not fully resolve commitment failures.
Abstract
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
