Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei; Zhipeng Luo; Ling Feng; Venice Erin Liong

arXiv:2512.24331·cs.CV·January 1, 2026

Spatial-aware Vision Language Model for Autonomous Driving

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

PDF

Open Access

TL;DR

This paper introduces LVLDrive, a framework that enhances vision-language models with 3D spatial understanding using LiDAR data, significantly improving autonomous driving scene comprehension and decision-making safety.

Contribution

The paper proposes a novel method to incorporate LiDAR data into pre-trained VLMs via a Gradual Fusion Q-Former, enabling robust 3D spatial reasoning for autonomous driving.

Findings

01

LVLDrive outperforms vision-only models on driving benchmarks.

02

The approach improves metric spatial perception accuracy.

03

The model enhances scene understanding and decision reliability.

Abstract

While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning