WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Keming Wu; Yijing Cui; Wenhan Xue; Qijie Wang; Xuan Luo; Zhiyuan Feng; Zuhao Yang; Sudong Wang; Sicong Jiang; Haowei Zhu; Zihan Wang; Ping Nie; Wenhu Chen; Bin Wang

arXiv:2605.10434·cs.CV·May 12, 2026

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Keming Wu, Yijing Cui, Wenhan Xue, Qijie Wang, Xuan Luo, Zhiyuan Feng, Zuhao Yang, Sudong Wang, Sicong Jiang, Haowei Zhu, Zihan Wang, Ping Nie, Wenhu Chen, Bin Wang

PDF

1 Repo

TL;DR

WorldReasonBench introduces a comprehensive benchmark for evaluating whether video generators can accurately reason about world dynamics, focusing on physical, social, logical, and informational consistency.

Contribution

It presents a new benchmark with structured QA annotations and a two-part human-aligned evaluation methodology for assessing world reasoning in video generation models.

Findings

01

Modern video generators often lack accurate world reasoning despite visual plausibility.

02

The benchmark reveals persistent gaps in causality, dynamics, and information preservation in generated videos.

03

Evaluation toolkit and benchmarks will be publicly released to advance research in world-aware video generation.

Abstract

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UniX-AI-Lab/WorldReasonBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.