Zero-Shot 3D Visual Grounding from Vision-Language Models
Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, Junwei Liang

TL;DR
This paper introduces SeeGround, a zero-shot 3D visual grounding framework that leverages 2D vision-language models to locate objects in 3D scenes without requiring 3D-specific training, enabling scalable and generalizable applications.
Contribution
The paper proposes a novel zero-shot 3D visual grounding method using vision-language models with a hybrid input format and specialized modules for viewpoint adaptation and signal fusion.
Findings
Outperforms existing zero-shot methods by 7.7% and 7.1% on ScanRefer and Nr3D.
Rivals fully supervised approaches in 3D visual grounding.
Demonstrates strong generalization in open-world scenarios.
Abstract
3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions, enabling downstream applications such as augmented reality and robotics. Existing approaches typically rely on labeled 3D data and predefined categories, limiting scalability to open-world settings. We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training. To bridge the modality gap, we introduce a hybrid input format that pairs query-aligned rendered views with spatially enriched textual descriptions. Our framework incorporates two core components: a Perspective Adaptation Module that dynamically selects optimal viewpoints based on the query, and a Fusion Alignment Module that integrates visual and spatial signals to enhance localization precision. Extensive evaluations on ScanRefer and Nr3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
