SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching
Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

TL;DR
SceneGraphGrounder introduces a zero-shot 3D visual grounding method that uses structured scene graph matching and a visual marker prompting strategy to improve spatial reasoning and interpretability.
Contribution
It reformulates 3D grounding as structured graph matching over a reconstructed scene graph, enabling view-independent reasoning and interpretability in zero-shot settings.
Findings
Achieves competitive zero-shot performance on ScanRefer benchmark.
Demonstrates robust spatial reasoning in real-world robot deployment.
Uses only RGB-D inputs for 3D scene understanding.
Abstract
Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
