Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Xinxin Liu; Zhaopan Xu; Ming Li; Kai Wang; Yong Jae Lee; Yuzhang Shang

arXiv:2511.13853·cs.CV·February 13, 2026

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, Yuzhang Shang

PDF

Open Access

TL;DR

Gen-ViRe is a new benchmark designed to evaluate the reasoning capabilities of video generation models in simulating real-world dynamics through multi-step, physics-grounded reasoning tasks, addressing a gap in existing evaluation methods.

Contribution

This paper introduces Gen-ViRe, the first comprehensive framework to quantitatively assess visual reasoning in video models across multiple cognitive dimensions and subtasks.

Findings

01

State-of-the-art models show high visual fidelity but limited reasoning depth.

02

Gen-ViRe provides diagnostic insights into model reasoning abilities.

03

Baseline results highlight significant room for improvement in visual reasoning.

Abstract

While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Multimodal Machine Learning Applications · Social Robot Interaction and HRI