DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models
Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Xianpeng Lang

TL;DR
DriveAction introduces a comprehensive benchmark for evaluating human-like driving decisions in vision-language-action models, emphasizing scenario diversity, real-world data, and action-based evaluation to improve autonomous driving systems.
Contribution
This work presents the first action-driven benchmark with real-world data, detailed annotations, and an evaluation framework tailored for VLA models in autonomous driving.
Findings
Vision-language models need both vision and language guidance for accurate actions.
Model accuracy decreases by 3.3% without vision and 4.1% without language guidance.
The benchmark enables precise identification of model bottlenecks and supports advancing human-like driving decisions.
Abstract
Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by drivers of autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from drivers' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper introduces DriveAction, a well-structured benchmark. 2. Dataset quality is high, with real-world, driver-contributed data with diverse scenarios.
1. I suggest that the authors include a discussion of recent studies on VLM-generated datasets for autonomous driving that are built on different foundations. For example, some works such as [1][2] generate data based on existing datasets like nuScenes or nuPlan, while others use internal datasets. Highlighting these distinctions would help the community better understand the overall differences and positioning of this work. [1] Y. Xu et al., “VLM-AD: End-to-End Autonomous Driving through Visio
1. Proposes DriveAction, the first benchmark explicitly designed for Vision-Language-Action (VLA) evaluation in autonomous driving, addressing missing links between vision, language, and action reasoning. 2. Action labels are collected directly from real-time driver operations, faithfully capturing human decision intent rather than post-hoc annotations. 3. The action-rooted, tree-structured framework enables interpretable, modular analysis across V-L-A components, offering fine-grained evaluat
1. While the benchmark is well-structured, its main finding (that models need both vision and language inputs) is intuitive and not conceptually groundbreaking. 2. Previous works like DriveLM (Sima et al., 2024) and Reason2Drive (Nie et al., 2024) already explore end-to-end reasoning or goal-driven evaluation, weakening the “first action-driven” claim. 3. Evaluation focuses on accuracy without deeper breakdowns (e.g., statistical variance, error typology, or causal reasoning analysis).
1. Collecting 16k QA pairs from 2,610 real-world driving scenarios contributed by professional drivers. 2. Using real-time driver actions as ground-truth labels to capture authentic human decision intent. 3. Proposing an action-rooted tree-structured evaluation framework that connects vision, language, and action layers.
As we all know, VLA models are inherently action-centric, and thus the action dimension should play a more decisive role in evaluation. However, DriveAction primarily emphasizes open-loop QA assessments on Dynamic, Static, Navigation, and Efficiency tasks, rather than measuring closed-loop driving behavior that reflects real-time control and long-horizon decision consistency. So it makes me confused. I think the author needs to discuss more about the importance of this benchmark in the community
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety
