DriveVA: Video Action Models are Zero-Shot Drivers

Mengmeng Liu; Diankun Zhang; Jiuming Liu; Jianfeng Cui; Hongwei Xie; Guang Chen; Hangjun Ye; Michael Ying Yang; Francesco Nex; Hao Cheng

arXiv:2604.04198·cs.CV·April 7, 2026

DriveVA: Video Action Models are Zero-Shot Drivers

Mengmeng Liu, Diankun Zhang, Jiuming Liu, Jianfeng Cui, Hongwei Xie, Guang Chen, Hangjun Ye, Michael Ying Yang, Francesco Nex, Hao Cheng

PDF

TL;DR

DriveVA is a novel world model for autonomous driving that jointly predicts future videos and actions, demonstrating strong zero-shot generalization and improved planning consistency across diverse scenarios.

Contribution

It introduces a joint video-action decoding framework using a DiT-based decoder and a video continuation strategy to enhance generalization and trajectory consistency.

Findings

01

Achieves 90.9 PDM score on NAVSIM challenge.

02

Reduces L2 error by 78.9% on nuScenes.

03

Reduces collision rate by 83.3% on nuScenes.

Abstract

Generalization is a central challenge in autonomous driving, as real-world deployment requires robust performance under unseen scenarios, sensor domains, and environmental conditions. Recent world-model-based planning methods have shown strong capabilities in scene understanding and multi-modal future prediction, yet their generalization across datasets and sensor configurations remains limited. In addition, their loosely coupled planning paradigm often leads to poor video-trajectory consistency during visual imagination. To overcome these limitations, we propose DriveVA, a novel autonomous driving world model that jointly decodes future visual forecasts and action sequences in a shared latent generative process. DriveVA inherits rich priors on motion dynamics and physical plausibility from well-pretrained large-scale video generation models to capture continuous spatiotemporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.