A Unified Framework for 3D Point Cloud Visual Grounding
Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong, Jin, Donghao Luo, Yan Wang, Liujuan Cao, Rongrong Ji

TL;DR
This paper introduces a unified 3D point cloud visual grounding framework, 3DRefTR, that combines 3D referring expression comprehension and segmentation, achieving superior performance with minimal additional latency.
Contribution
It proposes a novel unified transformer-based framework that integrates 3DREC and 3DRES, utilizing a Superpoint Mask Branch for efficient computation and improved accuracy.
Findings
Outperforms state-of-the-art 3DRES by 12.43% mIoU on ScanRefer
Improves 3DREC accuracy by 0.6% at 0.25 IoU
Achieves this with only 6% additional latency
Abstract
Thanks to its precise spatial referencing, 3D point cloud visual grounding is essential for deep understanding and dynamic interaction in 3D environments, encompassing 3D Referring Expression Comprehension (3DREC) and Segmentation (3DRES). We argue that 3DREC and 3DRES should be unified in one framework, which is also a natural progression in the community. To explain, 3DREC help 3DRES locate the referent, while 3DRES also facilitate 3DREC via more fine-grained language-visual alignment. To achieve this, this paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3D Referring Transformer (3DRefTR). Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch. Specially, we propose Superpoint Mask Branch, which serves a dual purpose: i) By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · 3D Surveying and Cultural Heritage · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection
