DeepVerse: 4D Autoregressive Video Generation as a World Model

Junyi Chen; Haoyi Zhu; Xianglong He; Yifan Wang; Jianjun Zhou; Wenzheng Chang; Yang Zhou; Zizun Li; Zhoujie Fu; Jiangmiao Pang; Tong He

arXiv:2506.01103·cs.CV·June 3, 2025

DeepVerse: 4D Autoregressive Video Generation as a World Model

Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, Tong He

PDF

Open Access 1 Models 4 Reviews

TL;DR

DeepVerse introduces a 4D world model that explicitly incorporates geometric predictions to improve long-term, consistent, and realistic video generation, addressing limitations of previous visual-only models.

Contribution

It is the first to explicitly integrate geometric constraints into a 4D autoregressive video generation model for enhanced physical and spatial consistency.

Findings

01

Reduces drift and improves temporal consistency in generated videos.

02

Enhances prediction accuracy and visual realism.

03

Preserves long-term spatial coherence through geometry-aware memory retrieval.

Abstract

World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

This paper integrates 4D explicit modeling into an autoregressive world model and couples the "spatial memory" mechanism with parallel spatial distribution modeling. The final results of this paper are promising; 4D explicit modeling improves temporal and spatial consistency. The paper has a clear structure (not related work).

Weaknesses

Confusing and unsupported arguments: Line068 in main paper, "However, these visual-centric strategies fundamentally overlook a critical aspect: videos inherently represent 2D projections of a dynamic 3D/4D physical world. Without explicit modeling of underlying geometric structures, models inevitably struggle to maintain long-term accuracy and consistency in visual predictions," this is the core motivation of the paper; however, such motivation does not cite any references and lacks experimental

Reviewer 02Rating 4Confidence 3

Strengths

1. This paper addresses a novel problem: 4D generation and long-term memory, a current hot topic, and the methods, 4D representations (depth + raymap), token-level historical fusion, geometry-based memory retrieval, presented in this work are comprehensive. 2. The paper provides detailed ablation studies on the effect of geometric inputs, spatial conditioning, and historical fusion strategies, supported by both quantitative metrics and visual comparisons. 3. The paper is very well written and

Weaknesses

1. The core idea of incorporating geometry and memory into world models has been explored in several concurrent works (e.g., Aether[1], TesserAct[2], WorldMem[3]). The novelty mainly lies in combining these components rather than introducing a fundamentally new mechanism. 2. The paper lacks direct quantitative comparisons with strong recent baselines such as Aether[1], DFoT[4], or FramePack[5]. 3. The geometry-based retrieval module relies on handcrafted distance and orientation metrics, whi

Reviewer 03Rating 2Confidence 5

Strengths

1. The core idea of having a world model generate explicit 3D geometry (depth and camera pose) in addition to pixels is a promising direction for improving long-term consistency in video generation. 2. Using the generated 3D camera pose as a key to retrieve relevant past states from a memory bank is an intuitive and logical approach to help the model "remember" previously visited locations and maintain spatial coherence. 3. The work serves as a reasonable proof-of-concept, demonstrating the fe

Weaknesses

1. Clarity and Writing: The paper's quality of writing is a significant barrier. The text is often not coherent and difficult to parse, which required substantial and undue effort from the reviewer to understand the proposed methodology and experimental setup. 2. Lack of content: The technical contribution feels limited. A large portion of the paper (e.g., Section 3.1) is dedicated to exploring architectures for conditioning on historical information (channel-wise vs. token-wise). This topic ha

Reviewer 04Rating 6Confidence 4

Strengths

- Large scale model for 4D sequence generation is impressive - Visual generation results and video consistency improve by adding depth - The retrieval based on spatial closeness mechanism is simple and effective

Weaknesses

- Missing baselines. No direct comparisons to other works doing 4D modelling. It's easy to show that the model is better than any model not using depth, however it's hard to evaluate author's contribution when there are no comparisons to other models with the same inputs. - The comparison of the two model architectures (one with channel-wise concatenation, and the other one with token-wise concatenation) is a little confusing. If those two models have different initalization, comparing their fin

Code & Models

Models

🤗
SOTAMak1r/DeepVerse1.1
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCinema and Media Studies