SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

Zhao Jin; Rong-Cheng Tu; Jingyi Liao; Wenhao Sun; Xiao Luo; Shunyu Liu; Dacheng Tao

arXiv:2506.21924·cs.CV·June 30, 2025

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

Zhao Jin, Rong-Cheng Tu, Jingyi Liao, Wenhao Sun, Xiao Luo, Shunyu Liu, Dacheng Tao

PDF

Open Access

TL;DR

SPAzer is a novel zero-shot 3D visual grounding method that combines spatial and semantic reasoning using pre-trained vision-language models, achieving significant accuracy improvements without requiring 3D training data.

Contribution

Introduces SPAZER, a VLM-driven agent that integrates spatial and semantic reasoning in a progressive framework for zero-shot 3D visual grounding.

Findings

01

Outperforms previous zero-shot methods on ScanRefer and Nr3D datasets.

02

Achieves 9.0% and 10.9% accuracy improvements respectively.

03

Effectively bridges spatial and semantic understanding in 3D grounding.

Abstract

3D Visual Grounding (3DVG) aims to localize target objects within a 3D scene based on natural language queries. To alleviate the reliance on costly 3D training data, recent studies have explored zero-shot 3DVG by leveraging the extensive knowledge and powerful reasoning capabilities of pre-trained LLMs and VLMs. However, existing paradigms tend to emphasize either spatial (3D-based) or semantic (2D-based) understanding, limiting their effectiveness in complex real-world applications. In this work, we introduce SPAZER - a VLM-driven agent that combines both modalities in a progressive reasoning framework. It first holistically analyzes the scene and produces a 3D rendering from the optimal viewpoint. Based on this, anchor-guided candidate screening is conducted to perform a coarse-level localization of potential objects. Furthermore, leveraging retrieved relevant 2D camera images, 3D-2D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition