VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception
Fuhao Chang, Shuxin Li, Yabei Li, Lei He

TL;DR
VLM-3D introduces an end-to-end vision-language model for open-world 3D perception in autonomous driving, integrating semantic and geometric understanding to improve accuracy and safety in complex environments.
Contribution
It is the first end-to-end framework enabling VLMs to perform 3D perception in autonomous driving, utilizing LoRA adaptation and a joint semantic-geometric loss for enhanced accuracy.
Findings
12.8% improvement in perception accuracy on nuScenes dataset
Effective integration of semantic and geometric cues in 3D perception
Validation of end-to-end VLM approach for autonomous driving scenarios
Abstract
Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
