TL;DR
This paper introduces LSVG, a novel 3D visual grounding framework that uses language-guided scene graphs and 2D-assisted multi-modal encoding to improve object discrimination and relational understanding in complex scenes.
Contribution
The paper presents a new approach combining scene graphs, 2D semantics, and graph attention for enhanced 3D visual grounding, addressing limitations of previous target-centered methods.
Findings
Outperforms state-of-the-art methods on benchmark datasets.
Effectively distinguishes similar objects using scene graph structure.
Improves relational perception in 3D visual grounding tasks.
Abstract
3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the described spatial relationships. Current methods attempt to achieve cross-modal understanding in complex scenes via a target-centered learning mechanism, ignoring the modeling of referred objects. We propose a novel 3D visual grounding framework that constructs language-guided scene graphs with referred object discrimination to improve relational perception. The framework incorporates a dual-branch visual encoder that leverages pre-trained 2D semantics to enhance and supervise the multi-modal 3D encoding. Furthermore, we employ graph attention to promote relationship-oriented information fusion in cross-modal interaction. The learned object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
