LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding

Feng Xiao; Hongbin Xu; Guocan Zhao; Wenxiong Kang

arXiv:2505.04058·cs.CV·August 18, 2025

LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding

Feng Xiao, Hongbin Xu, Guocan Zhao, Wenxiong Kang

PDF

1 Repo

TL;DR

This paper introduces LSVG, a novel 3D visual grounding framework that uses language-guided scene graphs and 2D-assisted multi-modal encoding to improve object discrimination and relational understanding in complex scenes.

Contribution

The paper presents a new approach combining scene graphs, 2D semantics, and graph attention for enhanced 3D visual grounding, addressing limitations of previous target-centered methods.

Findings

01

Outperforms state-of-the-art methods on benchmark datasets.

02

Effectively distinguishes similar objects using scene graph structure.

03

Improves relational perception in 3D visual grounding tasks.

Abstract

3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the described spatial relationships. Current methods attempt to achieve cross-modal understanding in complex scenes via a target-centered learning mechanism, ignoring the modeling of referred objects. We propose a novel 3D visual grounding framework that constructs language-guided scene graphs with referred object discrimination to improve relational perception. The framework incorporates a dual-branch visual encoder that leverages pre-trained 2D semantics to enhance and supervise the multi-modal 3D encoding. Furthermore, we employ graph attention to promote relationship-oriented information fusion in cross-modal interaction. The learned object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

onmyoji-xiao/AS3D
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need