Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices

Hokin Deng

arXiv:2512.05969·cs.CV·December 9, 2025

Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven' Matrices

Hokin Deng

PDF

Open Access

TL;DR

Video generation models are now capable of reasoning on complex tasks like chess, Sudoku, and Raven's Matrices, achieving significant success rates through a scalable evaluation paradigm.

Contribution

The paper introduces a new experimental paradigm and code framework for evaluating reasoning in video models, enabling scalable and automated assessment aligned with human judgment.

Findings

01

Models like Sora-2 reach 60% success on reasoning tasks.

02

A robust, scalable evaluation paradigm correlates well with human judgment.

03

Open-source code and results facilitate further research and reinforcement learning.

Abstract

We show that video generation models could reason now. Testing on tasks such as chess, maze, Sudoku, mental rotation, and Raven's Matrices, leading models such as Sora-2 achieve sixty percent success rates. We establish a robust experimental paradigm centered on the "Task Pair" design. We build a code framework, with 39 models available already, that supports this paradigm and allows for easy scaling - users can add models and tasks efficiently. We show our automated evaluation strongly correlates with human judgment, and therefore this paradigm is highly scalable. We see an opportunity, given the availability of our paradigm, to do reinforcement learning for improving reasoning in video models. You could checkout all of our raw $\href$ and our $\href$ codebase.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications