Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Cheng Yang; Haiyuan Wan; Yiran Peng; Xin Cheng; Zhaoyang Yu; Jiayi Zhang; Junchi Yu; Xinlei Yu; Xiawu Zheng; Dongzhan Zhou; Chenglin Wu

arXiv:2511.15065·cs.CV·November 25, 2025

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Cheng Yang, Haiyuan Wan, Yiran Peng, Xin Cheng, Zhaoyang Yu, Jiayi Zhang, Junchi Yu, Xinlei Yu, Xiawu Zheng, Dongzhan Zhou, Chenglin Wu

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces VR-Bench, a new benchmark for evaluating video models' reasoning abilities through maze-solving tasks, demonstrating their spatial reasoning strengths and the benefits of diverse sampling during inference.

Contribution

The paper presents VR-Bench, the first comprehensive benchmark for reasoning via video, and empirically evaluates video models' spatial reasoning capabilities using maze-solving tasks.

Findings

01

Video models outperform VLMs in spatial reasoning tasks.

02

Diverse sampling during inference improves reasoning reliability by 10-20%.

03

Video models generalize well across different maze types and complexities.

Abstract

Video Models have achieved remarkable success in high-fidelity video generation with coherent motion dynamics. Analogous to the development from text generation to text-based reasoning in language modeling, the development of video models motivates us to ask: Can video models reason via video generation? Compared with the discrete text corpus, video grounds reasoning in explicit spatial layouts and temporal continuity, which serves as an ideal substrate for spatial reasoning. In this work, we explore the reasoning via video paradigm and introduce VR-Bench -- a comprehensive benchmark designed to systematically evaluate video models' reasoning capabilities. Grounded in maze-solving tasks that inherently require spatial planning and multi-step reasoning, VR-Bench contains 7,920 procedurally generated videos across five maze types and diverse visual styles. Our empirical analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
HY-Wan/Wan-R1
model· ♡ 2
♡ 2

Datasets

amagipeng/VR-Bench
dataset· 44 dl
44 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Artificial Intelligence in Games