VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

Jie Wang; Guang Li; Zhijian Huang; Chenxu Dang; Hangjun Ye; Yahong Han; Long Chen

arXiv:2602.20794·cs.CV·February 25, 2026

VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen

PDF

Open Access

TL;DR

VGGDrive introduces a novel architecture that integrates cross-view 3D geometric grounding into vision-language models, significantly improving autonomous driving tasks by leveraging mature 3D foundation models.

Contribution

The paper proposes VGGDrive, a new architecture with a plug-and-play module that injects 3D geometric features into VLMs for autonomous driving, addressing a key capability gap.

Findings

01

Enhances VLM performance on five autonomous driving benchmarks

02

Improves cross-view risk perception, motion prediction, and trajectory planning

03

Demonstrates the effectiveness of integrating 3D geometric grounding into VLMs

Abstract

The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization