DeepVerse: 4D Autoregressive Video Generation as a World Model
Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, Tong He

TL;DR
DeepVerse introduces a 4D world model that explicitly incorporates geometric predictions to improve long-term, consistent, and realistic video generation, addressing limitations of previous visual-only models.
Contribution
It is the first to explicitly integrate geometric constraints into a 4D autoregressive video generation model for enhanced physical and spatial consistency.
Findings
Reduces drift and improves temporal consistency in generated videos.
Enhances prediction accuracy and visual realism.
Preserves long-term spatial coherence through geometry-aware memory retrieval.
Abstract
World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency,…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This paper integrates 4D explicit modeling into an autoregressive world model and couples the "spatial memory" mechanism with parallel spatial distribution modeling. The final results of this paper are promising; 4D explicit modeling improves temporal and spatial consistency. The paper has a clear structure (not related work).
Confusing and unsupported arguments: Line068 in main paper, "However, these visual-centric strategies fundamentally overlook a critical aspect: videos inherently represent 2D projections of a dynamic 3D/4D physical world. Without explicit modeling of underlying geometric structures, models inevitably struggle to maintain long-term accuracy and consistency in visual predictions," this is the core motivation of the paper; however, such motivation does not cite any references and lacks experimental
1. This paper addresses a novel problem: 4D generation and long-term memory, a current hot topic, and the methods, 4D representations (depth + raymap), token-level historical fusion, geometry-based memory retrieval, presented in this work are comprehensive. 2. The paper provides detailed ablation studies on the effect of geometric inputs, spatial conditioning, and historical fusion strategies, supported by both quantitative metrics and visual comparisons. 3. The paper is very well written and
1. The core idea of incorporating geometry and memory into world models has been explored in several concurrent works (e.g., Aether[1], TesserAct[2], WorldMem[3]). The novelty mainly lies in combining these components rather than introducing a fundamentally new mechanism. 2. The paper lacks direct quantitative comparisons with strong recent baselines such as Aether[1], DFoT[4], or FramePack[5]. 3. The geometry-based retrieval module relies on handcrafted distance and orientation metrics, whi
1. The core idea of having a world model generate explicit 3D geometry (depth and camera pose) in addition to pixels is a promising direction for improving long-term consistency in video generation. 2. Using the generated 3D camera pose as a key to retrieve relevant past states from a memory bank is an intuitive and logical approach to help the model "remember" previously visited locations and maintain spatial coherence. 3. The work serves as a reasonable proof-of-concept, demonstrating the fe
1. Clarity and Writing: The paper's quality of writing is a significant barrier. The text is often not coherent and difficult to parse, which required substantial and undue effort from the reviewer to understand the proposed methodology and experimental setup. 2. Lack of content: The technical contribution feels limited. A large portion of the paper (e.g., Section 3.1) is dedicated to exploring architectures for conditioning on historical information (channel-wise vs. token-wise). This topic ha
- Large scale model for 4D sequence generation is impressive - Visual generation results and video consistency improve by adding depth - The retrieval based on spatial closeness mechanism is simple and effective
- Missing baselines. No direct comparisons to other works doing 4D modelling. It's easy to show that the model is better than any model not using depth, however it's hard to evaluate author's contribution when there are no comparisons to other models with the same inputs. - The comparison of the two model architectures (one with channel-wise concatenation, and the other one with token-wise concatenation) is a little confusing. If those two models have different initalization, comparing their fin
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCinema and Media Studies
