Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance
Phuc D.A. Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh, Tran, Cuong Pham, Khoi Nguyen

TL;DR
Open3DIS introduces a method that leverages 2D mask aggregation across frames to improve open-vocabulary 3D instance segmentation, especially for small and ambiguous objects, achieving state-of-the-art results.
Contribution
The paper presents a novel module that combines 2D mask aggregation with 3D proposals to enhance open-vocabulary 3D instance segmentation performance.
Findings
Significant performance improvements on ScanNet200, S3DIS, and Replica datasets.
Effective segmentation of small-scale and geometrically ambiguous objects.
Outperforms existing state-of-the-art methods in open-vocabulary 3D segmentation.
Abstract
We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals, they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Human Pose and Action Recognition
