A Mechanistic View on Video Generation as World Models: State and Dynamics

Luozhou Wang; Zhifei Chen; Yihua Du; Dongyu Yan; Wenhang Ge; Guibao Shen; Xinli Xu; Leyi Wu; Man Chen; Tianshuo Xu; Peiran Ren; Xin Tao; Pengfei Wan; Ying-Cong Chen

arXiv:2601.17067·cs.CV·January 27, 2026

A Mechanistic View on Video Generation as World Models: State and Dynamics

Luozhou Wang, Zhifei Chen, Yihua Du, Dongyu Yan, Wenhang Ge, Guibao Shen, Xinli Xu, Leyi Wu, Man Chen, Tianshuo Xu, Peiran Ren, Xin Tao, Pengfei Wan, Ying-Cong Chen

PDF

Open Access

TL;DR

This paper proposes a new framework for video generation models that emphasizes state construction and dynamics modeling, aiming to develop more robust world models capable of physical reasoning and causal understanding.

Contribution

It introduces a novel taxonomy for classifying video models based on state and dynamics, bridging the gap between current architectures and classical world model theories.

Findings

01

Categorizes video models into implicit and explicit state construction paradigms.

02

Analyzes dynamics modeling through knowledge integration and architectural reformulation.

03

Recommends shifting evaluation from visual quality to functional benchmarks like causality.

Abstract

Large-scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. However, a gap remains between contemporary "stateless" video architectures and classic state-centric world model theories. This work bridges this gap by proposing a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling. We categorize state construction into implicit paradigms (context management) and explicit paradigms (latent compression), while dynamics modeling is analyzed through knowledge integration and architectural reformulation. Furthermore, we advocate for a transition in evaluation from visual fidelity to functional benchmarks, testing physical persistence and causal reasoning. We conclude by identifying two critical frontiers: enhancing persistence via data-driven memory and compressed fidelity, and advancing causality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Human Pose and Action Recognition