TL;DR
DriveDreamer-Policy is a unified, geometry-aware world-action model for autonomous driving that integrates depth, video prediction, and planning, achieving state-of-the-art results on Navsim benchmarks.
Contribution
It introduces a modular architecture combining depth generation, video prediction, and planning guided by a geometry-aware world representation.
Findings
Achieves 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2 benchmarks.
Outperforms existing world-model-based approaches in planning and world generation.
Explicit depth learning enhances video imagination and planning robustness.
Abstract
Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
