A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

Zhenyang Liu; Sixiao Zheng; Siyu Chen; Cairong Zhao; Longfei Liang; Xiangyang Xue; Yanwei Fu

arXiv:2507.06719·cs.CV·July 10, 2025

A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

Zhenyang Liu, Sixiao Zheng, Siyu Chen, Cairong Zhao, Longfei Liang, Xiangyang Xue, Yanwei Fu

PDF

Open Access

TL;DR

This paper introduces SpatialReasoner, a neural framework that leverages LLM-driven spatial reasoning and visual properties to improve open-vocabulary 3D visual grounding, especially for spatial relation understanding.

Contribution

The work presents a novel neural representation framework with LLM-driven spatial reasoning and visual property integration for enhanced 3D visual grounding.

Findings

01

Outperforms baseline models in 3D visual grounding tasks.

02

Effectively captures spatial relations in language queries.

03

Seamlessly integrates with various neural representations.

Abstract

Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation

MethodsContrastive Language-Image Pre-training