OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation
Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan, Lasenby

TL;DR
OpenIns3D introduces a novel 3D open-vocabulary scene understanding framework that combines mask proposal, synthetic scene image generation, and category lookup, achieving state-of-the-art results across diverse 3D tasks.
Contribution
It presents a simple yet effective
Findings
Achieves state-of-the-art performance on 3D open-vocabulary tasks
Allows easy switching between different 2D detectors without retraining
Demonstrates strong reasoning with complex text queries when combined with LLMs
Abstract
In this work, we introduce OpenIns3D, a new 3D-input-only framework for 3D open-vocabulary scene understanding. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds, the "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision-language models to extract interesting objects, and the "Lookup" module searches through the outcomes of "Snap" to assign category names to the proposed masks. This approach, yet simple, achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks, including recognition, object detection, and instance segmentation, on both indoor and outdoor datasets. Moreover, OpenIns3D facilitates effortless switching between different 2D detectors without requiring retraining. When integrated with powerful 2D open-world models, it…
Peer Reviews
Decision·Submitted to ICLR 2024
- The proposed framework focuses on an RGB-agnostic setting and achieves precise 3D instance segmentation through a multi-stage approach. - The framework primarily relies on 3D proposals, establishing connections between 2D and 3D segments, and subsequently employs filtering operations that effectively leverage large-scale 2D vision models. - Experimental results, when compared to those presented in previous papers, clearly illustrate a remarkable enhancement in performance.
- The Mask Proposal Module is trainable using IoU as a form of supervision. It is strongly recommended to include comprehensive training details in the main draft of the paper. - Further clarification is needed regarding the adjustment of camera parameters in the Camera Intrinsic Calibration process. - It is advisable to incorporate a comparative analysis that includes segmentation results obtained from multiple pseudo-projected images. Given that your method heavily relies on prior knowledge fr
- The writing of the paper is easy to follow. - The paper tackles the interesting task of OV point cloud instance segmentation. - The scores compared to some baselines look promising, even without the use of 2D images.
- At its core, the method is built on top of a somewhat flawed assumption. How can we obtain RGB point clouds, without actually having aligned RGB images? Of course there might be LiDAR point clouds without aligned RGB images, but at that point we can also not create synthetic RGB images from an uncolored point cloud, to feed into a 2D model expecting RGB images. While I still see some potential benefit, like being able to render novel images that are more focused on certain objects or better su
1. The experimental results are impressive. The OpenIns3D achieves superior quantitative results compared with other methods. 2. The idea is interesting. The authors propose a novel framework that can achieve 3D open-vocabulary scene understanding without 2D images. 2. This paper is well-written and maintains a smooth flow. The whole pipeline is easy to understand.
1. As the method consists of multiple steps, the authors should provide more training details for all steps in the main text or appendix. 2. Will the performance of 2D Open-world Detector influence the performance of the OpenIns3D? The authors seems not to provide experimental results in ablation study. 3. Although the authors propose a 3D open-vocabulary scene understanding without 2D images, this method still needs well-prepared point clouds. It seems that 3D point clouds are also difficult to
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
