DriveVLM: The Convergence of Autonomous Driving and Large   Vision-Language Models

Xiaoyu Tian; Junru Gu; Bailin Li; Yicheng Liu; Yang Wang; Zhiyong; Zhao; Kun Zhan; Peng Jia; Xianpeng Lang; Hang Zhao

arXiv:2402.12289·cs.CV·June 26, 2024·22 cites

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong, Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

PDF

Open Access

TL;DR

DriveVLM introduces a vision-language model-based system for autonomous driving that enhances scene understanding and planning, combining reasoning modules with traditional pipelines for improved performance in complex urban scenarios.

Contribution

The paper presents DriveVLM, a novel autonomous driving system integrating vision-language models with reasoning modules, and DriveVLM-Dual, a hybrid system addressing VLM limitations for real-world deployment.

Findings

01

Effective in complex urban scenarios

02

Improves scene understanding and planning

03

Validated on real-world vehicle deployment

Abstract

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications