Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

Chengzu Li; Zanyi Wang; Jiaang Li; Yi Xu; Han Zhou; Huanyu Zhang; Ruichuan An; Dengyang Jiang; Zhaochong An; Ivan Vuli\'c; Serge Belongie; Anna Korhonen

arXiv:2601.21037·cs.LG·January 30, 2026

Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

Chengzu Li, Zanyi Wang, Jiaang Li, Yi Xu, Han Zhou, Huanyu Zhang, Ruichuan An, Dengyang Jiang, Zhaochong An, Ivan Vuli\'c, Serge Belongie, Anna Korhonen

PDF

Open Access

TL;DR

This paper introduces a video generation-based approach for visual reasoning, demonstrating that generated frames serve as intermediate steps, with test-time scaling improving zero-shot generalization in complex spatial tasks.

Contribution

It proposes using video generation models for visual reasoning, highlighting the importance of visual context and test-time scaling for improved zero-shot performance.

Findings

01

Strong zero-shot generalization across tasks

02

Effective use of visual context for control and adaptation

03

Test-time scaling law improves reasoning in complex paths

Abstract

Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Robot Manipulation and Learning