TL;DR
TransRefer3D introduces a Transformer-based model with entity- and relation-aware attention modules for fine-grained 3D visual grounding, significantly improving accuracy over previous methods.
Contribution
This work is the first to apply Transformer architecture with specialized attention modules to fine-grained 3D visual grounding, enhancing discriminative feature learning.
Findings
Outperforms existing methods by up to 10.6% on Nr3D and Sr3D datasets.
Achieves state-of-the-art results in fine-grained 3D visual grounding.
Demonstrates effectiveness of entity- and relation-aware attention modules.
Abstract
Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-and-relation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Attentive Walk-Aggregating Graph Neural Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Adam · Label Smoothing · Residual Connection
