DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang

TL;DR
DrivingGPT unifies driving world modeling and planning using multimodal autoregressive transformers, enabling improved video generation and trajectory planning by modeling images and actions jointly.
Contribution
The paper introduces DrivingGPT, a novel multimodal transformer that combines world modeling and planning into a single sequence prediction framework.
Findings
Outperforms baselines on nuPlan and NAVSIM benchmarks.
Effective joint modeling of images and actions.
Enables both video generation and trajectory planning.
Abstract
World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Automated Road and Building Extraction · Autonomous Vehicle Technology and Safety
MethodsDiffusion
