DriveVA: Video Action Models are Zero-Shot Drivers
Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, Hao Cheng

TL;DR
DriveVA is a novel world model for autonomous driving that jointly predicts future videos and actions, demonstrating strong zero-shot generalization and improved planning consistency across diverse scenarios.
Contribution
It introduces a joint video-action decoding framework using a DiT-based decoder and a video continuation strategy to enhance generalization and trajectory consistency.
Findings
Achieves 90.9 PDM score on NAVSIM challenge.
Reduces L2 error by 78.9% on nuScenes.
Reduces collision rate by 83.3% on nuScenes.
Abstract
Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
