Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li

TL;DR
This paper introduces 3D Phrase Aware Grounding (3DPAG), a fine-grained and interpretable extension of 3D visual grounding that explicitly associates language phrases with objects in 3D scenes, supported by a large annotated dataset.
Contribution
It proposes a new fine-grained grounding task, creates a large phrase-level annotation dataset, and develops methods that significantly improve 3D visual grounding accuracy.
Findings
Achieved up to 4.6% accuracy improvements on benchmark datasets.
Developed a large dataset with 227K phrase-level annotations.
Enhanced 3D grounding performance through novel phrase-object alignment and pre-training.
Abstract
Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more fine-grained and interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227K phrase-level annotations using a self-developed platform, from 88K sentences of widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsAttentive Walk-Aggregating Graph Neural Network
