3D-DRES: Detailed 3D Referring Expression Segmentation
Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Liujuan Cao

TL;DR
This paper introduces 3D-DRES, a new task for detailed 3D referring expression segmentation, supported by a novel dataset and baseline model, improving fine-grained 3D vision-language understanding.
Contribution
It proposes the 3D-DRES task, creates the DetailRefer dataset with phrase-instance annotations, and develops the DetailBase model for dual-mode segmentation.
Findings
Models trained on DetailRefer excel at phrase-level segmentation.
Training on DetailRefer improves performance on traditional 3D-RES benchmarks.
The dataset enables richer compositional reasoning in 3D visual grounding.
Abstract
Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Text Readability and Simplification
