OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Zhening Huang; Xiaoyang Wu; Xi Chen; Hengshuang Zhao; Lei Zhu; Joan; Lasenby

arXiv:2309.00616·cs.CV·August 13, 2024·6 cites

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, Joan, Lasenby

PDF

Open Access 1 Repo 3 Reviews

TL;DR

OpenIns3D introduces a novel 3D open-vocabulary scene understanding framework that combines mask proposal, synthetic scene image generation, and category lookup, achieving state-of-the-art results across diverse 3D tasks.

Contribution

It presents a simple yet effective

Findings

01

Achieves state-of-the-art performance on 3D open-vocabulary tasks

02

Allows easy switching between different 2D detectors without retraining

03

Demonstrates strong reasoning with complex text queries when combined with LLMs

Abstract

In this work, we introduce OpenIns3D, a new 3D-input-only framework for 3D open-vocabulary scene understanding. The OpenIns3D framework employs a "Mask-Snap-Lookup" scheme. The "Mask" module learns class-agnostic mask proposals in 3D point clouds, the "Snap" module generates synthetic scene-level images at multiple scales and leverages 2D vision-language models to extract interesting objects, and the "Lookup" module searches through the outcomes of "Snap" to assign category names to the proposed masks. This approach, yet simple, achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks, including recognition, object detection, and instance segmentation, on both indoor and outdoor datasets. Moreover, OpenIns3D facilitates effortless switching between different 2D detectors without requiring retraining. When integrated with powerful 2D open-world models, it…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The proposed framework focuses on an RGB-agnostic setting and achieves precise 3D instance segmentation through a multi-stage approach. - The framework primarily relies on 3D proposals, establishing connections between 2D and 3D segments, and subsequently employs filtering operations that effectively leverage large-scale 2D vision models. - Experimental results, when compared to those presented in previous papers, clearly illustrate a remarkable enhancement in performance.

Weaknesses

- The Mask Proposal Module is trainable using IoU as a form of supervision. It is strongly recommended to include comprehensive training details in the main draft of the paper. - Further clarification is needed regarding the adjustment of camera parameters in the Camera Intrinsic Calibration process. - It is advisable to incorporate a comparative analysis that includes segmentation results obtained from multiple pseudo-projected images. Given that your method heavily relies on prior knowledge fr

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

- The writing of the paper is easy to follow. - The paper tackles the interesting task of OV point cloud instance segmentation. - The scores compared to some baselines look promising, even without the use of 2D images.

Weaknesses

- At its core, the method is built on top of a somewhat flawed assumption. How can we obtain RGB point clouds, without actually having aligned RGB images? Of course there might be LiDAR point clouds without aligned RGB images, but at that point we can also not create synthetic RGB images from an uncolored point cloud, to feed into a 2D model expecting RGB images. While I still see some potential benefit, like being able to render novel images that are more focused on certain objects or better su

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The experimental results are impressive. The OpenIns3D achieves superior quantitative results compared with other methods. 2. The idea is interesting. The authors propose a novel framework that can achieve 3D open-vocabulary scene understanding without 2D images. 2. This paper is well-written and maintains a smooth flow. The whole pipeline is easy to understand.

Weaknesses

1. As the method consists of multiple steps, the authors should provide more training details for all steps in the main text or appendix. 2. Will the performance of 2D Open-world Detector influence the performance of the OpenIns3D? The authors seems not to provide experimental results in ablation study. 3. Although the authors propose a 3D open-vocabulary scene understanding without 2D images, this method still needs well-prepared point clouds. It seems that 3D point clouds are also difficult to

Code & Models

Repositories

Pointcept/OpenIns3D
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications