Unifying Language-Action Understanding and Generation for Autonomous Driving
Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, Wei Chen

TL;DR
This paper presents LinkVLA, a unified model for autonomous driving that improves language-action alignment and efficiency through shared codebooks, bidirectional training, and a fast coarse-to-fine decoding method, leading to better performance and lower latency.
Contribution
The paper introduces LinkVLA, a novel architecture unifying language and action tokens, and a two-step coarse-to-fine generation approach, significantly enhancing alignment and inference speed in autonomous driving.
Findings
Improved instruction following accuracy in driving benchmarks.
Reduced inference time by 86% with C2F decoding.
Enhanced cross-modal consistency and semantic understanding.
Abstract
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
