Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang

TL;DR
This paper introduces 'Think, Act, Build', a dynamic agentic framework that leverages vision-language models and multi-view geometry for zero-shot 3D visual grounding, bypassing static proposal matching.
Contribution
It proposes a novel generative 2D-to-3D reconstruction paradigm using open-source models, with a Semantic-Anchored Geometric Expansion mechanism for improved multi-view spatial understanding.
Findings
Outperforms previous zero-shot methods on ScanRefer and Nr3D datasets.
Surpasses some fully supervised baselines in 3D visual grounding accuracy.
Introduces a new framework that decouples 3D grounding from static proposal matching.
Abstract
3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
