Spatial-aware Vision Language Model for Autonomous Driving
Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

TL;DR
This paper introduces LVLDrive, a framework that enhances vision-language models with 3D spatial understanding using LiDAR data, significantly improving autonomous driving scene comprehension and decision-making safety.
Contribution
The paper proposes a novel method to incorporate LiDAR data into pre-trained VLMs via a Gradual Fusion Q-Former, enabling robust 3D spatial reasoning for autonomous driving.
Findings
LVLDrive outperforms vision-only models on driving benchmarks.
The approach improves metric spatial perception accuracy.
The model enhances scene understanding and decision reliability.
Abstract
While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
