VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process

Cristian Gariboldi; Hayato Tokida; Ken Kinjo; Yuki Asada; Alexander Carballo

arXiv:2507.01284·cs.RO·July 3, 2025

VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process

Cristian Gariboldi, Hayato Tokida, Ken Kinjo, Yuki Asada, Alexander Carballo

PDF

Open Access

TL;DR

This paper introduces VLAD, a novel autonomous driving framework that integrates vision-language models with hierarchical planning and interpretable decision-making, significantly improving safety and transparency in real-world scenarios.

Contribution

The paper presents a new VLM-augmented autonomous driving system with specialized fine-tuning, high-level command generation, and natural language explanations, advancing interpretability and performance.

Findings

01

Reduces collision rates by 31.82% on nuScenes dataset

02

Enhances spatial reasoning through custom question-answer training

03

Establishes new benchmark for VLM-based autonomous driving

Abstract

Recent advancements in open-source Visual Language Models (VLMs) such as LLaVA, Qwen-VL, and Llama have catalyzed extensive research on their integration with diverse systems. The internet-scale general knowledge encapsulated within these models presents significant opportunities for enhancing autonomous driving perception, prediction, and planning capabilities. In this paper we propose VLAD, a vision-language autonomous driving model, which integrates a fine-tuned VLM with VAD, a state-of-the-art end-to-end system. We implement a specialized fine-tuning approach using custom question-answer datasets designed specifically to improve the spatial reasoning capabilities of the model. The enhanced VLM generates high-level navigational commands that VAD subsequently processes to guide vehicle operation. Additionally, our system produces interpretable natural language explanations of driving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Advanced Neural Network Applications