VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen

TL;DR
VGGDrive introduces a novel architecture that integrates cross-view 3D geometric grounding into vision-language models, significantly improving autonomous driving tasks by leveraging mature 3D foundation models.
Contribution
The paper proposes VGGDrive, a new architecture with a plug-and-play module that injects 3D geometric features into VLMs for autonomous driving, addressing a key capability gap.
Findings
Enhances VLM performance on five autonomous driving benchmarks
Improves cross-view risk perception, motion prediction, and trajectory planning
Demonstrates the effectiveness of integrating 3D geometric grounding into VLMs
Abstract
The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
