Demystifing Video Reasoning
Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

TL;DR
This paper reveals that reasoning in diffusion-based video models primarily occurs along denoising steps, involving emergent behaviors like working memory and self-correction, challenging previous assumptions of frame-by-frame reasoning.
Contribution
It uncovers the Chain-of-Steps mechanism, identifies emergent reasoning behaviors, and proposes a simple ensembling strategy to enhance reasoning in diffusion video models.
Findings
Reasoning emerges along denoising steps, not frames.
Models explore multiple solutions early and converge later.
Layer specialization encodes perception, reasoning, and consolidation.
Abstract
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Embodied and Extended Cognition · Ferroelectric and Negative Capacitance Devices
