Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, Yuzhang Shang

TL;DR
Gen-ViRe is a new benchmark designed to evaluate the reasoning capabilities of video generation models in simulating real-world dynamics through multi-step, physics-grounded reasoning tasks, addressing a gap in existing evaluation methods.
Contribution
This paper introduces Gen-ViRe, the first comprehensive framework to quantitatively assess visual reasoning in video models across multiple cognitive dimensions and subtasks.
Findings
State-of-the-art models show high visual fidelity but limited reasoning depth.
Gen-ViRe provides diagnostic insights into model reasoning abilities.
Baseline results highlight significant room for improvement in visual reasoning.
Abstract
While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
