Multi3DRefer: Grounding Text Description to Multiple 3D Objects

Yiming Zhang; ZeMing Gong; Angel X. Chang

arXiv:2309.05251·cs.CV·September 12, 2023

Multi3DRefer: Grounding Text Description to Multiple 3D Objects

Yiming Zhang, ZeMing Gong, Angel X. Chang

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces Multi3DRefer, a new task and dataset for localizing multiple objects in 3D scenes based on natural language, addressing the limitations of existing single-object grounding methods.

Contribution

It proposes a new multi-object grounding task, extends the ScanRefer dataset, and develops a CLIP-based baseline that outperforms previous methods.

Findings

01

The dataset contains 61,926 descriptions of 11,609 objects.

02

The new baseline outperforms previous state-of-the-art methods.

03

A novel evaluation metric for multi-object 3D grounding is introduced.

Abstract

We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

3dlg-hcvc/M3DRef-CLIP
pytorchOfficial

Datasets

Videos

Multi3DRefer: Grounding Text Description to Multiple 3D Objects· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques

MethodsFocus · Contrastive Language-Image Pre-training