Lowis3D: Language-Driven Open-World Instance-Level 3D Scene   Understanding

Runyu Ding; Jihan Yang; Chuhui Xue; Wenqing Zhang; Song Bai; Xiaojuan; Qi

arXiv:2308.00353·cs.CV·August 2, 2023

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan, Qi

PDF

Open Access

TL;DR

Lowis3D leverages pre-trained vision-language models and hierarchical caption associations to improve open-world 3D scene understanding, enabling recognition and localization of unseen object categories with significant performance gains.

Contribution

The paper introduces a novel framework that uses vision-language models and hierarchical point-caption associations for open-world 3D scene understanding, addressing the scarcity of 3D-text pairs.

Findings

01

Significant improvements in semantic segmentation accuracy.

02

Enhanced instance and panoptic segmentation performance.

03

Effective localization of novel 3D objects in open-world scenarios.

Abstract

Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications