SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph   Attention

Feng Xiao; Hongbin Xu; Qiuxia Wu; Wenxiong Kang

arXiv:2403.08182·cs.CV·March 14, 2024·1 cites

SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Feng Xiao, Hongbin Xu, Qiuxia Wu, Wenxiong Kang

PDF

Open Access 1 Repo

TL;DR

SeCG introduces a semantic-enhanced graph attention model that improves 3D visual grounding by better capturing complex relationships and reducing visual interference, leading to state-of-the-art results on benchmark datasets.

Contribution

The paper proposes a novel cross-modal graph attention network with memory graph attention layers for enhanced 3D visual grounding.

Findings

01

Outperforms existing methods on ReferIt3D and ScanRefer datasets.

02

Significantly improves localization in multi-relation scenarios.

03

Enhances relation-oriented mapping between language and visual data.

Abstract

3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in the description. Experiments show that direct matching of language and visual modal has limited capacity to comprehend complex referential relationships in utterances. It is mainly due to the interference caused by redundant visual information in cross-modal alignment. To strengthen relation-orientated mapping between different modalities, we propose SeCG, a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. More text-related feature expressions are obtained through the guidance of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

onmyoji-xiao/3dvg_secg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition