Segment Any 3D Object with Language
Seungjun Lee, Yuyang Zhao, Gim Hee Lee

TL;DR
This paper introduces SOLE, a novel 3D segmentation framework that uses language instructions to generate semantic and geometric-aware masks directly from 3D point clouds, improving generalization to unseen categories.
Contribution
The paper presents a multimodal fusion network and multimodal supervision strategies for open-vocabulary 3D segmentation, achieving state-of-the-art results without class annotations.
Findings
Outperforms previous methods on ScanNet benchmarks
Close to fully-supervised performance without class annotations
Demonstrates versatility with language instructions
Abstract
In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works that rely on only annotated base categories for training suffer from limited generalization to unseen novel categories. Recent works mitigate poor generalizability to novel categories by generating class-agnostic masks or projecting generalized masks from 2D to 3D, but disregard semantic or geometry information, leading to sub-optimal performance. Instead, generating generalizable but semantic-related masks directly from 3D point clouds would result in superior outcomes. In this paper, we introduce Segment any 3D Object with LanguagE (SOLE), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion…
Peer Reviews
Decision·ICLR 2025 Poster
1. **[Reasonable Design]** Unlike the previous paradigm, which separates "mask proposal" and "mask understanding", this framework combines these two processes. This integrated approach seems reasonable, as the inclusion of additional features from an image foundation model is likely to improve mask proposal quality and, consequently, overall performance. 2. **[Good Results]** The reasonable design leads to improved accuracy compared to previous literature. The paper validates its performance on
1. **[Potential Efficiency Issue]** A major concern relates to the original SOLE’s efficiency, particularly in terms of speed and memory consumption. Aggregating raw 2D images into a 3D point cloud is likely to be slow. Even if this process is considered a preprocessing step, the loading and processing of per-point CLIP features could also be extremely resource-intensive. This may pose a limitation for real-world applications. 2. **[Evaluation of Efficiency]** Building on the previous point, th
The problem studied is interesting and has broad application prospects, particularly in the interactive understanding of 3D scenes. This paper is well- presentation with a clear motivation, and sufficient detail provided in the methods, making it easy to follow. The proposed method offers a new perspective, which is to directly predict semantic-related masks from 3D point clouds with multimodal information.
What do the five parts in the Feature Backbone represent? Although not a contribution of this paper, it would be best to clearly illustrate this for better self-containment. The transition from class-agnostic to class-aware is achieved by introducing point-wise CLIP features. However, converting 3D point clouds to images for point-wise CLIP feature extraction and then projecting the features back to 3D seems to have significant computational overhead.
1. The proposed feature ensemble module is effective in fusing features from both the backbone and CLIP. 2. The proposed MEA effectively improves the model performance by introducing fine-grained visual-language association. 3. The proposed SOLE shows superior performance on ScanNetv2 and ScanNet200 compared to both OpenIns3D and OpenMask3D.
1. In Table 5, it is unclear whether the proposed CMD is effective under a voxel size of 4cm. It would be better to provide the results that remove the CMD under a voxel size of 4cm and compare with the results at the 3-th row. 2. Is the proposed method sensitive to the hyperparameters $\lambda_{MMA}, \lambda_{dice}, \lambda_{BCE}$”? More discussions are required. 3. In Tables 1-4, it is unclear why the authors ignore the results of Open3DIS with both 2D and 3D supervision. More discussions are
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization
MethodsBalanced Selection · ALIGN
