TL;DR
DriveLaW introduces a unified framework combining high-fidelity video prediction and reliable motion planning for autonomous driving, improving both tasks through a shared latent representation.
Contribution
It proposes DriveLaW, a novel paradigm that unifies video generation and motion planning using a shared latent space, with a three-stage training strategy.
Findings
Surpasses state-of-the-art in video prediction metrics (FID and FVD).
Achieves a new record on the NAVSIM planning benchmark.
Demonstrates consistent and reliable trajectory planning from generated video latent representations.
Abstract
World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
