Toward Explainable and Fine-Grained 3D Grounding through Referring   Textual Phrases

Zhihao Yuan; Xu Yan; Zhuo Li; Xuhao Li; Yao Guo; Shuguang Cui; Zhen Li

arXiv:2207.01821·cs.CV·May 30, 2023·6 cites

Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li

PDF

Open Access

TL;DR

This paper introduces 3D Phrase Aware Grounding (3DPAG), a fine-grained and interpretable extension of 3D visual grounding that explicitly associates language phrases with objects in 3D scenes, supported by a large annotated dataset.

Contribution

It proposes a new fine-grained grounding task, creates a large phrase-level annotation dataset, and develops methods that significantly improve 3D visual grounding accuracy.

Findings

01

Achieved up to 4.6% accuracy improvements on benchmark datasets.

02

Developed a large dataset with 227K phrase-level annotations.

03

Enhanced 3D grounding performance through novel phrase-object alignment and pre-training.

Abstract

Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more fine-grained and interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227K phrase-level annotations using a self-developed platform, from 88K sentences of widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsAttentive Walk-Aggregating Graph Neural Network