RefMask3D: Language-Guided Transformer for 3D Referring Segmentation
Shuting He, Henghui Ding

TL;DR
RefMask3D introduces a novel transformer-based model for 3D referring segmentation that effectively fuses vision and language features, achieving state-of-the-art results on multiple datasets.
Contribution
The paper proposes Geometry-Enhanced Group-Word Attention, Linguistic Primitives Construction, and an Object Cluster Module to improve multi-modal understanding in 3D segmentation.
Findings
Achieves new state-of-the-art performance on 3D referring segmentation datasets.
Outperforms previous methods by 3.16% mIoU on ScanRefer.
Effective multi-modal feature fusion for irregular point clouds.
Abstract
3D referring segmentation is an emerging and challenging vision-language task that aims to segment the object described by a natural language expression in a point cloud scene. The key challenge behind this task is vision-language feature fusion and alignment. In this work, we propose RefMask3D to explore the comprehensive multi-modal feature interaction and understanding. First, we propose a Geometry-Enhanced Group-Word Attention to integrate language with geometrically coherent sub-clouds through cross-modal group-word attention, which effectively addresses the challenges posed by the sparse and irregular nature of point clouds. Then, we introduce a Linguistic Primitives Construction to produce semantic primitives representing distinct semantic attributes, which greatly enhance the vision-language understanding at the decoding stage. Furthermore, we introduce an Object Cluster Module…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
