SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

Xuefei Sun; Xujia Zhang; Brendan Crowe; Doncey Albin; Christoffer Heckman

arXiv:2605.21788·cs.CV·May 22, 2026

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

PDF

TL;DR

SceneGraphGrounder introduces a zero-shot 3D visual grounding method that uses structured scene graph matching and a visual marker prompting strategy to improve spatial reasoning and interpretability.

Contribution

It reformulates 3D grounding as structured graph matching over a reconstructed scene graph, enabling view-independent reasoning and interpretability in zero-shot settings.

Findings

01

Achieves competitive zero-shot performance on ScanRefer benchmark.

02

Demonstrates robust spatial reasoning in real-world robot deployment.

03

Uses only RGB-D inputs for 3D scene understanding.

Abstract

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.