A Unified Framework for 3D Point Cloud Visual Grounding

Haojia Lin; Yongdong Luo; Xiawu Zheng; Lijiang Li; Fei Chao; Taisong; Jin; Donghao Luo; Yan Wang; Liujuan Cao; Rongrong Ji

arXiv:2308.11887·cs.CV·November 21, 2023·2 cites

A Unified Framework for 3D Point Cloud Visual Grounding

Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong, Jin, Donghao Luo, Yan Wang, Liujuan Cao, Rongrong Ji

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified 3D point cloud visual grounding framework, 3DRefTR, that combines 3D referring expression comprehension and segmentation, achieving superior performance with minimal additional latency.

Contribution

It proposes a novel unified transformer-based framework that integrates 3DREC and 3DRES, utilizing a Superpoint Mask Branch for efficient computation and improved accuracy.

Findings

01

Outperforms state-of-the-art 3DRES by 12.43% mIoU on ScanRefer

02

Improves 3DREC accuracy by 0.6% at 0.25 IoU

03

Achieves this with only 6% additional latency

Abstract

Thanks to its precise spatial referencing, 3D point cloud visual grounding is essential for deep understanding and dynamic interaction in 3D environments, encompassing 3D Referring Expression Comprehension (3DREC) and Segmentation (3DRES). We argue that 3DREC and 3DRES should be unified in one framework, which is also a natural progression in the community. To explain, 3DREC help 3DRES locate the referent, while 3DRES also facilitate 3DREC via more fine-grained language-visual alignment. To achieve this, this paper takes the initiative step to integrate 3DREC and 3DRES into a unified framework, termed 3D Referring Transformer (3DRefTR). Its key idea is to build upon a mature 3DREC model and leverage ready query embeddings and visual tokens from the 3DREC model to construct a dedicated mask branch. Specially, we propose Superpoint Mask Branch, which serves a dual purpose: i) By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leon1207/3dreftr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · 3D Surveying and Cultural Heritage · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections · Absolute Position Encodings · Residual Connection