View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

Yuanyuan Liu; Haiyang Mei; Dongyang Zhan; Jiayue Zhao; Dongsheng Zhou; Bo Dong; Xin Yang

arXiv:2512.09215·cs.CV·December 11, 2025

View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

Yuanyuan Liu, Haiyang Mei, Dongyang Zhan, Jiayue Zhao, Dongsheng Zhou, Bo Dong, Xin Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces View-on-Graph, a novel method for zero-shot 3D visual grounding that uses scene graphs to enable vision-language models to reason more effectively and interpretably about 3D scenes.

Contribution

It proposes a new paradigm and a scene graph-based method that improves zero-shot 3D visual grounding by enabling selective, step-by-step reasoning over 3D scenes.

Findings

01

Achieves state-of-the-art zero-shot performance in 3D visual grounding.

02

Structured scene exploration reduces reasoning difficulty for VLMs.

03

Provides transparent, interpretable reasoning traces.

Abstract

3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

View-on-Graph: Zero-Shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices