SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Rong Li; Shijie Li; Lingdong Kong; Xulei Yang; Junwei Liang

arXiv:2412.04383·cs.CV·May 30, 2025

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, Junwei Liang

PDF

Open Access

TL;DR

SeeGround introduces a zero-shot 3D visual grounding framework that leverages 2D vision-language models and hybrid scene representations to improve object localization without requiring annotated 3D datasets.

Contribution

The paper proposes a novel zero-shot 3D visual grounding method using large-scale 2D VLMs and hybrid scene representations, surpassing existing methods in accuracy.

Findings

01

Outperforms existing zero-shot methods by large margins

02

Exceeds weakly supervised methods in 3D visual grounding

03

Rivals some fully supervised approaches with significant improvements

Abstract

3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on textual descriptions, essential for applications like augmented reality and robotics. Traditional 3DVG approaches rely on annotated 3D datasets and predefined object categories, limiting scalability and adaptability. To overcome these limitations, we introduce SeeGround, a zero-shot 3DVG framework leveraging 2D Vision-Language Models (VLMs) trained on large-scale 2D data. SeeGround represents 3D scenes as a hybrid of query-aligned rendered images and spatially enriched text descriptions, bridging the gap between 3D data and 2D-VLMs input formats. We propose two modules: the Perspective Adaptation Module, which dynamically selects viewpoints for query-relevant image rendering, and the Fusion Alignment Module, which integrates 2D images with 3D spatial descriptions to enhance object localization. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization