Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Ziqi Ma; Mengzhan Liufu; Georgia Gkioxari

arXiv:2603.13215·cs.CV·March 16, 2026

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari

PDF

Open Access 1 Datasets

TL;DR

This paper introduces STEVO-Bench, a benchmark to evaluate whether video world models can accurately simulate natural state evolution independently of observations, revealing their limitations and biases.

Contribution

The paper presents STEVO-Bench, a novel benchmark with an evaluation protocol to analyze and identify failure modes in current video world models' ability to decouple state evolution from observation.

Findings

01

Video models struggle to decouple state evolution from observation.

02

Current models exhibit biases in natural state evolution.

03

STEVO-Bench reveals specific failure modes in existing models.

Abstract

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

JhanLiufu/StEvo-Bench
dataset· 981 dl
981 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Multimodal Machine Learning Applications