VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua, Lin

TL;DR
VLM-Grounder introduces a zero-shot 3D visual grounding framework that leverages vision-language models and 2D images, outperforming previous methods without requiring 3D data or object priors.
Contribution
The paper presents a novel VLM-based framework for zero-shot 3D visual grounding using only 2D images, enhancing accuracy and flexibility over existing object-centric approaches.
Findings
Achieves 51.6% accuracy on ScanRefer at 0.25 IoU
Attains 48.0% accuracy on Nr3D dataset
Outperforms previous zero-shot methods in 3D grounding
Abstract
3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding. Traditional methods depending on supervised learning with 3D point clouds are limited by scarce datasets. Recently zero-shot methods leveraging LLMs have been proposed to address the data issue. While effective, these methods only use object-centric information, limiting their ability to handle complex queries. In this work, we present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images. VLM-Grounder dynamically stitches image sequences, employs a grounding and feedback scheme to find the target object, and uses a multi-view ensemble projection to accurately estimate 3D bounding boxes. Experiments on ScanRefer and Nr3D datasets show VLM-Grounder outperforms previous zero-shot methods, achieving 51.6%…
Peer Reviews
Decision·CoRL 2024
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Optical Sensing Technologies · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
