Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

Haibo Wang; Zihao Lin; Zhiyang Xu; Lifu Huang

arXiv:2604.00528·cs.CV·April 3, 2026

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding

Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces 'Think, Act, Build', a dynamic agentic framework that leverages vision-language models and multi-view geometry for zero-shot 3D visual grounding, bypassing static proposal matching.

Contribution

It proposes a novel generative 2D-to-3D reconstruction paradigm using open-source models, with a Semantic-Anchored Geometric Expansion mechanism for improved multi-view spatial understanding.

Findings

01

Outperforms previous zero-shot methods on ScanRefer and Nr3D datasets.

02

Surpasses some fully supervised baselines in 3D visual grounding accuracy.

03

Introduces a new framework that decouples 3D grounding from static proposal matching.

Abstract

3D Visual Grounding (3D-VG) aims to localize objects in 3D scenes via natural language descriptions. While recent advancements leveraging Vision-Language Models (VLMs) have explored zero-shot possibilities, they typically suffer from a static workflow relying on preprocessed 3D point clouds, essentially degrading grounding into proposal matching. To bypass this reliance, our core motivation is to decouple the task: leveraging 2D VLMs to resolve complex spatial semantics, while relying on deterministic multi-view geometry to instantiate the 3D structure. Driven by this insight, we propose "Think, Act, Build (TAB)", a dynamic agentic framework that reformulates 3D-VG tasks as a generative 2D-to-3D reconstruction paradigm operating directly on raw RGB-D streams. Specifically, guided by a specialized 3D-VG skill, our VLM agent dynamically invokes visual tools to track and reconstruct the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whb139426/TAB-Agent
github

Datasets

WHB139426/Scannet
dataset· 3.2k dl
3.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.