Multi3DRefer: Grounding Text Description to Multiple 3D Objects
Yiming Zhang, ZeMing Gong, Angel X. Chang

TL;DR
This paper introduces Multi3DRefer, a new task and dataset for localizing multiple objects in 3D scenes based on natural language, addressing the limitations of existing single-object grounding methods.
Contribution
It proposes a new multi-object grounding task, extends the ScanRefer dataset, and develops a CLIP-based baseline that outperforms previous methods.
Findings
The dataset contains 61,926 descriptions of 11,609 objects.
The new baseline outperforms previous state-of-the-art methods.
A novel evaluation metric for multi-object 3D grounding is introduced.
Abstract
We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Multi3DRefer: Grounding Text Description to Multiple 3D Objects· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
MethodsFocus · Contrastive Language-Image Pre-training
