UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Yongkang Li; Lijun Zhou; Sixu Yan; Bencheng Liao; Tianyi Yan; Kaixin Xiong; Long Chen; Hongwei Xie; Bing Wang; Guang Chen; Hangjun Ye; Wenyu Liu; Haiyang Sun; Xinggang Wang

arXiv:2604.02190·cs.CV·April 3, 2026

UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang

PDF

1 Repo

TL;DR

UniDriveVLA is a unified model for autonomous driving that effectively combines perception, understanding, and action planning using a Mixture-of-Transformers approach, achieving state-of-the-art results.

Contribution

It introduces a decoupled expert architecture with a novel training strategy to balance spatial perception and semantic reasoning in driving models.

Findings

01

Achieves state-of-the-art performance on nuScenes and Bench2Drive datasets.

02

Demonstrates broad applicability across perception, prediction, and understanding tasks.

03

Effectively balances perception and reasoning through expert decoupling and progressive training.

Abstract

Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaomi-research/unidrivevla
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.