Are VLMs Ready for Autonomous Driving? An Empirical Study from the   Reliability, Data, and Metric Perspectives

Shaoyuan Xie; Lingdong Kong; Yuhao Dong; Chonghao Sima; Wenwei Zhang,; Qi Alfred Chen; Ziwei Liu; Liang Pan

arXiv:2501.04003·cs.CV·January 8, 2025

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang,, Qi Alfred Chen, Ziwei Liu, Liang Pan

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the reliability of Vision-Language Models (VLMs) for autonomous driving, revealing their limitations in visual grounding and robustness, and proposes improved evaluation metrics and future directions for safer deployment.

Contribution

Introduction of DriveBench, a comprehensive benchmark dataset and evaluation framework for assessing VLM reliability in autonomous driving scenarios, highlighting current limitations and proposing solutions.

Findings

01

VLMs often rely on textual cues rather than true visual grounding.

02

VLMs are sensitive to input corruptions, affecting performance.

03

Current evaluation metrics may conceal reliability issues.

Abstract

Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opendrivelab/drivelm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTransportation Planning and Optimization · Human-Automation Interaction and Safety · Vehicle emissions and performance