ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

Haonan Wang; Hanyu Zhou; Tao Gu; Luxin Yan

arXiv:2605.07390·cs.CV·May 11, 2026

ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation

Haonan Wang, Hanyu Zhou, Tao Gu, Luxin Yan

PDF

TL;DR

ST-Gen4D introduces a novel 4D spatiotemporal cognition-based world model that enhances 4D generation by capturing local dynamics and global appearance, outperforming existing methods.

Contribution

The paper presents a new framework integrating 4D cognition with generative priors, enabling structurally rational 4D generation with topological consistency.

Findings

01

Outperforms existing 4D generation methods in experiments.

02

Guarantees structural rationality and topological consistency.

03

Introduces ST-4D datasets for benchmarking.

Abstract

Generative models have achieved success in producing apparently coherent 2D videos, but remain challenging in the physical world due to lack of 4D spatiotemporal scale. Typically, existing 4D generative models directly embed macro scale constraints to enhance overall spatiotemporal consistency. However, these methods only ensure global appearance coherence and fail to reveal the local dynamics of the physical world. Our insight is that global appearance structure and local dynamic topology empower 4D spatiotemporal cognition, thereby enabling 4D generation with spatiotemporal regularities. In this work, we propose ST-Gen4D, a 4D generation framework with 4D spatiotemporal cognition-based world model. Our model is guided by four key designs: 1) Spatiotemporal representation. We encode various modalities into multiple representations as a feature basis. 2) Spatiotemporal cognition. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.