VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

Runsen Xu; Zhiwei Huang; Tai Wang; Yilun Chen; Jiangmiao Pang; Dahua; Lin

arXiv:2410.13860·cs.CV·October 18, 2024

VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding

Runsen Xu, Zhiwei Huang, Tai Wang, Yilun Chen, Jiangmiao Pang, Dahua, Lin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

VLM-Grounder introduces a zero-shot 3D visual grounding framework that leverages vision-language models and 2D images, outperforming previous methods without requiring 3D data or object priors.

Contribution

The paper presents a novel VLM-based framework for zero-shot 3D visual grounding using only 2D images, enhancing accuracy and flexibility over existing object-centric approaches.

Findings

01

Achieves 51.6% accuracy on ScanRefer at 0.25 IoU

02

Attains 48.0% accuracy on Nr3D dataset

03

Outperforms previous zero-shot methods in 3D grounding

Abstract

3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding. Traditional methods depending on supervised learning with 3D point clouds are limited by scarce datasets. Recently zero-shot methods leveraging LLMs have been proposed to address the data issue. While effective, these methods only use object-centric information, limiting their ability to handle complex queries. In this work, we present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images. VLM-Grounder dynamically stitches image sequences, employs a grounding and feedback scheme to find the target object, and uses a multi-view ensemble projection to accurately estimate 3D bounding boxes. Experiments on ScanRefer and Nr3D datasets show VLM-Grounder outperforms previous zero-shot methods, achieving 51.6%…

Peer Reviews

Decision·CoRL 2024

Reviewer 01Rating 2Confidence 4

Reviewer 02Rating 3Confidence 5

Reviewer 03Rating 2Confidence 3

Code & Models

Repositories

openrobotlab/vlm-grounder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Optical Sensing Technologies · Advanced Vision and Imaging · Robotics and Sensor-Based Localization