DrivingWorld: Constructing World Model for Autonomous Driving via Video   GPT

Xiaotao Hu; Wei Yin; Mingkai Jia; Junyuan Deng; Xiaoyang Guo; Qian; Zhang; Xiaoxiao Long; Ping Tan

arXiv:2412.19505·cs.CV·December 31, 2024

DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT

Xiaotao Hu, Wei Yin, Mingkai Jia, Junyuan Deng, Xiaoyang Guo, Qian, Zhang, Xiaoxiao Long, Ping Tan

PDF

Open Access 1 Repo 1 Models

TL;DR

DrivingWorld introduces a novel GPT-style model with spatial-temporal fusion mechanisms for autonomous driving, enabling high-fidelity, long-duration video generation and improved control over future scene prediction.

Contribution

The paper presents a new spatial-temporal fusion GPT-style model with strategies for better generalization and control, significantly surpassing prior models in video quality and duration.

Findings

01

Generates high-quality driving videos over 40 seconds long

02

Achieves over twice the duration of previous state-of-the-art models

03

Outperforms prior methods in visual fidelity and controllability

Abstract

Recent successes in autoregressive (AR) generation models, such as the GPT series in natural language processing, have motivated efforts to replicate this success in visual tasks. Some works attempt to extend this approach to autonomous driving by building video-based world models capable of generating realistic future video sequences and predicting ego states. However, prior works tend to produce unsatisfactory results, as the classic GPT framework is designed to handle 1D contextual information, such as text, and lacks the inherent ability to model the spatial and temporal dynamics essential for video generation. In this paper, we present DrivingWorld, a GPT-style world model for autonomous driving, featuring several spatial-temporal fusion mechanisms. This design enables effective modeling of both spatial and temporal dynamics, facilitating high-fidelity, long-duration video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yvanyin/drivingworld
pytorchOfficial

Models

🤗
huxiaotaostasy/DrivingWorld
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data Technologies and Applications

MethodsCosine Annealing · Linear Layer · Residual Connection · Weight Decay · Multi-Head Attention · Adam · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Discriminative Fine-Tuning