VLMs Guided Interpretable Decision Making for Autonomous Driving
Xin Hu, Taotao Jing, Renran Tian, Zhengming Ding

TL;DR
This paper proposes a novel approach that uses vision-language models to enhance interpretability and reliability in autonomous driving decision-making by enriching visual data with linguistic descriptions and multi-modal fusion.
Contribution
The work introduces a multi-modal architecture that leverages VLMs as semantic enhancers and a post-hoc refinement module, improving decision accuracy and interpretability in autonomous driving.
Findings
Achieves state-of-the-art performance on autonomous driving benchmarks.
Enables more interpretable decision explanations.
Improves decision reliability through semantic scene enrichment.
Abstract
Recent advancements in autonomous driving (AD) have explored the use of vision-language models (VLMs) within visual question answering (VQA) frameworks for direct driving decision-making. However, these approaches often depend on handcrafted prompts and suffer from inconsistent performance, limiting their robustness and generalization in real-world scenarios. In this work, we evaluate state-of-the-art open-source VLMs on high-level decision-making tasks using ego-view visual inputs and identify critical limitations in their ability to deliver reliable, context-aware decisions. Motivated by these observations, we propose a new approach that shifts the role of VLMs from direct decision generators to semantic enhancers. Specifically, we leverage their strong general scene understanding to enrich existing vision-based benchmarks with structured, linguistically rich scene descriptions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
