WorldModelBench: Judging Video Generation Models As World Models

Dacheng Li; Yunhao Fang; Yukang Chen; Shuo Yang; Shiyi Cao; Justin; Wong; Michael Luo; Xiaolong Wang; Hongxu Yin; Joseph E. Gonzalez; Ion Stoica,; Song Han; Yao Lu

arXiv:2502.20694·cs.CV·March 3, 2025

WorldModelBench: Judging Video Generation Models As World Models

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin, Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica,, Song Han, Yao Lu

PDF

TL;DR

WorldModelBench is a new benchmark for evaluating video generation models as world models, focusing on physics adherence and decision-making relevance, with a large-scale human-labeled dataset and an automated judger.

Contribution

We introduce WorldModelBench, a comprehensive benchmark that evaluates physics and instruction adherence in video world models, supported by a large human annotation dataset and an improved automated evaluation method.

Findings

01

WorldModelBench detects subtle physics violations in video models.

02

Crowd-sourced 67K human labels enable accurate evaluation.

03

Fine-tuning an automated judger improves violation detection accuracy.

Abstract

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.