MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu

TL;DR
MaskGWM introduces a novel generalizable driving world model that leverages video mask reconstruction with a scalable diffusion transformer, enhancing long-term prediction and multi-view generation for autonomous driving.
Contribution
The paper proposes MaskGWM, a new driving world model combining diffusion transformers with mask reconstruction, extending MAE to spatial-temporal domains for better generalization.
Findings
Outperforms state-of-the-art on multiple driving datasets
Improves long-horizon prediction accuracy
Enhances multi-view generation capabilities
Abstract
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Residual Connection · Linear Layer · Dense Connections · Multi-Head Attention · Diffusion · Position-Wise Feed-Forward Layer · Masked autoencoder
