Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding
Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan, Qi

TL;DR
Lowis3D leverages pre-trained vision-language models and hierarchical caption associations to improve open-world 3D scene understanding, enabling recognition and localization of unseen object categories with significant performance gains.
Contribution
The paper introduces a novel framework that uses vision-language models and hierarchical point-caption associations for open-world 3D scene understanding, addressing the scarcity of 3D-text pairs.
Findings
Significant improvements in semantic segmentation accuracy.
Enhanced instance and panoptic segmentation performance.
Effective localization of novel 3D objects in open-world scenarios.
Abstract
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications
