MaskGWM: A Generalizable Driving World Model with Video Mask   Reconstruction

Jingcheng Ni; Yuxin Guo; Yichen Liu; Rui Chen; Lewei Lu; Zehuan Wu

arXiv:2502.11663·cs.CV·February 18, 2025

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu

PDF

Open Access 1 Repo

TL;DR

MaskGWM introduces a novel generalizable driving world model that leverages video mask reconstruction with a scalable diffusion transformer, enhancing long-term prediction and multi-view generation for autonomous driving.

Contribution

The paper proposes MaskGWM, a new driving world model combining diffusion transformers with mask reconstruction, extending MAE to spatial-temporal domains for better generalization.

Findings

01

Outperforms state-of-the-art on multiple driving datasets

02

Improves long-horizon prediction accuracy

03

Enhances multi-view generation capabilities

Abstract

World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sensetime-fvg/opendwm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Residual Connection · Linear Layer · Dense Connections · Multi-Head Attention · Diffusion · Position-Wise Feed-Forward Layer · Masked autoencoder