DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

Yuhan Hao; Zhengning Li; Lei Sun; Weilong Wang; Naixin Yi; Sheng Song; Caihong Qin; Mofan Zhou; Yifei Zhan; Xianpeng Lang

arXiv:2506.05667·cs.CV·September 29, 2025

DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Xianpeng Lang

PDF

Open Access 3 Reviews

TL;DR

DriveAction introduces a comprehensive benchmark for evaluating human-like driving decisions in vision-language-action models, emphasizing scenario diversity, real-world data, and action-based evaluation to improve autonomous driving systems.

Contribution

This work presents the first action-driven benchmark with real-world data, detailed annotations, and an evaluation framework tailored for VLA models in autonomous driving.

Findings

01

Vision-language models need both vision and language guidance for accurate actions.

02

Model accuracy decreases by 3.3% without vision and 4.1% without language guidance.

03

The benchmark enables precise identification of model bottlenecks and supports advancing human-like driving decisions.

Abstract

Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by drivers of autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from drivers' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper introduces DriveAction, a well-structured benchmark. 2. Dataset quality is high, with real-world, driver-contributed data with diverse scenarios.

Weaknesses

1. I suggest that the authors include a discussion of recent studies on VLM-generated datasets for autonomous driving that are built on different foundations. For example, some works such as [1][2] generate data based on existing datasets like nuScenes or nuPlan, while others use internal datasets. Highlighting these distinctions would help the community better understand the overall differences and positioning of this work. [1] Y. Xu et al., “VLM-AD: End-to-End Autonomous Driving through Visio

Reviewer 02Rating 4Confidence 4

Strengths

1. Proposes DriveAction, the first benchmark explicitly designed for Vision-Language-Action (VLA) evaluation in autonomous driving, addressing missing links between vision, language, and action reasoning. 2. Action labels are collected directly from real-time driver operations, faithfully capturing human decision intent rather than post-hoc annotations. 3. The action-rooted, tree-structured framework enables interpretable, modular analysis across V-L-A components, offering fine-grained evaluat

Weaknesses

1. While the benchmark is well-structured, its main finding (that models need both vision and language inputs) is intuitive and not conceptually groundbreaking. 2. Previous works like DriveLM (Sima et al., 2024) and Reason2Drive (Nie et al., 2024) already explore end-to-end reasoning or goal-driven evaluation, weakening the “first action-driven” claim. 3. Evaluation focuses on accuracy without deeper breakdowns (e.g., statistical variance, error typology, or causal reasoning analysis).

Reviewer 03Rating 2Confidence 3

Strengths

1. Collecting 16k QA pairs from 2,610 real-world driving scenarios contributed by professional drivers. 2. Using real-time driver actions as ground-truth labels to capture authentic human decision intent. 3. Proposing an action-rooted tree-structured evaluation framework that connects vision, language, and action layers.

Weaknesses

As we all know, VLA models are inherently action-centric, and thus the action dimension should play a more decisive role in evaluation. However, DriveAction primarily emphasizes open-loop QA assessments on Dynamic, Static, Navigation, and Efficiency tasks, rather than measuring closed-loop driving behavior that reflects real-time control and long-horizon decision consistency. So it makes me confused. I think the author needs to discuss more about the importance of this benchmark in the community

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety