DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

Feiyang jia; Lin Liu; Ziying Song; Caiyan Jia; Hangjun Ye; Xiaoshuai Hao; Long Chen

arXiv:2602.06521·cs.CV·February 9, 2026

DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

Feiyang jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, Long Chen

PDF

Open Access

TL;DR

DriveWorld-VLA introduces a unified latent-space framework for autonomous driving that integrates vision, language, and action modeling, enabling better scene prediction and decision-making without heavy supervision.

Contribution

It proposes a novel architecture that tightly integrates world modeling and planning in a shared latent space, improving scene evolution understanding and action planning in autonomous driving.

Findings

01

Achieves state-of-the-art performance on NAVSIMv1 and NAVSIMv2 datasets.

02

Reduces collision rates significantly on nuScenes.

03

Effectively models future scene evolution in latent space.

Abstract

End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications