Bench2Drive-VL: Benchmarks for Closed-Loop Autonomous Driving with Vision-Language Models

Xiaosong Jia; Yuqian Shao; Zhenjie Yang; Qifeng Li; Zhiyuan Zhang; Junchi Yan

arXiv:2604.01259·cs.RO·April 3, 2026

Bench2Drive-VL: Benchmarks for Closed-Loop Autonomous Driving with Vision-Language Models

Xiaosong Jia, Yuqian Shao, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, Junchi Yan

PDF

1 Datasets

TL;DR

This paper introduces Bench2Drive-VL, a comprehensive closed-loop benchmark for vision-language models in autonomous driving, enabling more realistic evaluation of model performance in diverse driving scenarios.

Contribution

It extends existing benchmarks by providing a closed-loop evaluation environment with diverse, behavior-grounded questions, and a flexible framework for VLMs in autonomous driving.

Findings

01

Enables evaluation of VLMs under out-of-distribution driving scenarios.

02

Provides a unified interface for integrating VLMs into closed-loop driving simulation.

03

Open sources code and datasets for community use.

Abstract

With the rise of vision-language models (VLM), their application for autonomous driving (VLM4AD) has gained significant attention. Meanwhile, in autonomous driving, closed-loop evaluation has become widely recognized as a more reliable validation method than open-loop evaluation, as it can evaluate the performance of the model under cumulative errors and out-of-distribution inputs. However, existing VLM4AD benchmarks evaluate the model`s scene understanding ability under open-loop, i.e., via static question-answer (QA) dataset. This kind of evaluation fails to assess the VLMs performance under out-of-distribution states rarely appeared in the human collected datasets.To this end, we present Bench2Drive-VL, an extension of Bench2Drive that brings closed-loop evaluation to VLM-based driving, which introduces: (1) DriveCommenter, a closed-loop generator that automatically generates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Telkwevr/Bench2Drive-VL-base
dataset· 4.5k dl
4.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.