Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Tianshuai Hu; Xiaolu Liu; Song Wang; Yiyao Zhu; Ao Liang; Lingdong Kong; Guoyang Zhao; Zeying Gong; Jun Cen; Zhiyu Huang; Xiaoshuai Hao; Linfeng Li; Hang Song; Xiangtai Li; Jun Ma; Shaojie Shen; Jianke Zhu; Dacheng Tao; Ziwei Liu; Junwei Liang

arXiv:2512.16760·cs.RO·January 6, 2026

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, Junwei Liang

PDF

Open Access

TL;DR

This paper reviews the evolution of Vision-Language-Action models in autonomous driving, highlighting their potential for more interpretable, robust, and human-aligned driving policies by integrating perception, reasoning, and language-grounded decision making.

Contribution

It provides a structured overview of VLA frameworks, categorizes existing methods, and discusses key challenges and future directions in autonomous driving research.

Findings

01

VLA models unify perception and decision-making for better interpretability.

02

Two main paradigms: End-to-End VLA and Dual-System VLA.

03

Identifies challenges like robustness, interpretability, and instruction fidelity.

Abstract

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)