DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT
Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian, Zhang, Xiaoxiao Long, Ping Tan

TL;DR
DrivingWorld introduces a novel GPT-style model with spatial-temporal fusion mechanisms for autonomous driving, enabling high-fidelity, long-duration video generation and improved control over future scene prediction.
Contribution
The paper presents a new spatial-temporal fusion GPT-style model with strategies for better generalization and control, significantly surpassing prior models in video quality and duration.
Findings
Generates high-quality driving videos over 40 seconds long
Achieves over twice the duration of previous state-of-the-art models
Outperforms prior methods in visual fidelity and controllability
Abstract
Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data Technologies and Applications
MethodsCosine Annealing · Linear Layer · Residual Connection · Weight Decay · Multi-Head Attention · Adam · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning
