VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

Fuhao Chang; Shuxin Li; Yabei Li; Lei He

arXiv:2508.09061·cs.CV·August 13, 2025

VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

Fuhao Chang, Shuxin Li, Yabei Li, Lei He

PDF

Open Access

TL;DR

VLM-3D introduces an end-to-end vision-language model for open-world 3D perception in autonomous driving, integrating semantic and geometric understanding to improve accuracy and safety in complex environments.

Contribution

It is the first end-to-end framework enabling VLMs to perform 3D perception in autonomous driving, utilizing LoRA adaptation and a joint semantic-geometric loss for enhanced accuracy.

Findings

01

12.8% improvement in perception accuracy on nuScenes dataset

02

Effective integration of semantic and geometric cues in 3D perception

03

Validation of end-to-end VLM approach for autonomous driving scenarios

Abstract

Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis