TL;DR
This paper presents ScanRefer, a novel method for localizing objects in 3D RGB-D scans using natural language, supported by a large-scale dataset, enabling direct language-based object identification in 3D environments.
Contribution
The paper introduces a new task, a fused descriptor model for 3D object localization with language, and a large-scale dataset for training and evaluation.
Findings
ScanRefer achieves accurate 3D object localization using natural language.
The dataset contains over 51,000 descriptions for diverse objects.
The method effectively correlates language with 3D geometric features.
Abstract
We introduce the task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, learning a fused descriptor from 3D object proposals and encoded sentence embeddings. This fused descriptor correlates language expressions with geometric features, enabling regression of the 3D bounding box of a target object. We also introduce the ScanRefer dataset, containing 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
