Unifying Language-Action Understanding and Generation for Autonomous Driving

Xinyang Wang; Qian Liu; Wenjie Ding; Zhao Yang; Wei Li; Chang Liu; Bailin Li; Kun Zhan; Xianpeng Lang; Wei Chen

arXiv:2603.01441·cs.CV·March 3, 2026

Unifying Language-Action Understanding and Generation for Autonomous Driving

Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, Wei Chen

PDF

Open Access

TL;DR

This paper presents LinkVLA, a unified model for autonomous driving that improves language-action alignment and efficiency through shared codebooks, bidirectional training, and a fast coarse-to-fine decoding method, leading to better performance and lower latency.

Contribution

The paper introduces LinkVLA, a novel architecture unifying language and action tokens, and a two-step coarse-to-fine generation approach, significantly enhancing alignment and inference speed in autonomous driving.

Findings

01

Improved instruction following accuracy in driving benchmarks.

02

Reduced inference time by 86% with C2F decoding.

03

Enhanced cross-modal consistency and semantic understanding.

Abstract

Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications