ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved   Visio-Linguistic Models in 3D Scenes

Ahmed Abdelreheem; Kyle Olszewski; Hsin-Ying Lee; Peter Wonka; Panos; Achlioptas

arXiv:2212.06250·cs.CV·April 4, 2023·5 cites

ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes

Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, Panos, Achlioptas

PDF

Open Access

TL;DR

ScanEnts3D is a large-scale dataset that links natural language to 3D objects, significantly enhancing visio-linguistic models' performance and interpretability in 3D scene understanding tasks.

Contribution

The paper introduces ScanEnts3D, a new dataset with explicit object-phrase correspondences, and demonstrates its effectiveness in improving neural models for 3D language understanding and generation.

Findings

01

Improves state-of-the-art in Nr3D and ScanRefer by 4.3% and 5.0%.

02

Enhances 3D neural speaker performance by 13.2 CIDEr points.

03

Supports better generalization and interpretability of visio-linguistic models.

Abstract

The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural language to real-world 3D data. In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances inside a 3D scene. Specifically, our Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k natural referential sentences, covering 705 real-world scenes. Crucially, we show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA in both the Nr3D and ScanRefer benchmarks by 4.3% and 5.0%, respectively. Moreover, we experiment with competitive baselines and recent methods for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Hand Gesture Recognition Systems

MethodsTest