Point-SAM: Promptable 3D Segmentation Model for Point Clouds
Yuchen Zhou, Jiayuan Gu, Tung Yen Chiang, Fanbo Xiang, Hao Su

TL;DR
Point-SAM introduces a promptable 3D segmentation model for point clouds, leveraging knowledge distillation from 2D SAM to improve scalability, accuracy, and versatility in 3D segmentation tasks.
Contribution
The paper presents a novel transformer-based 3D segmentation model, Point-SAM, that extends 2D SAM to point clouds and employs knowledge distillation for enhanced performance.
Findings
Outperforms state-of-the-art 3D segmentation models on benchmarks.
Enables interactive 3D annotation and zero-shot instance proposal.
Demonstrates effective knowledge transfer from 2D to 3D segmentation.
Abstract
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focusing on point clouds. We employ an efficient transformer-based architecture tailored for point clouds, extending SAM to the 3D domain. We then distill the rich knowledge from 2D SAM for Point-SAM training by introducing a data engine to generate part-level and object-level pseudo-labels at scale from 2D SAM. Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and…
Peer Reviews
Decision·ICLR 2025 Poster
+ The method is a step forward towards 3D foundational models, eliminating the need of using multiple 2D views of the 3D object and 2D-3D lifting of SAM proposals at inference, while it provides the ability of refining the 3D proposals with additional prompts (as seen in supp.). + Unified training strategy on 3D point clouds across several datasets, covering different modalities either at annotation or scale level (part, object masks; single object, entire scenes). + Good performance on zero-sho
- The proposed Voronoi diagram tokenizer while it manages to lower the computational and memory cost of the overall pipeline, in many cases it fails to surpass the performance of the k-nn based tokenizer. - Regarding the OOD scenarios and particularly the PartNet-Mobility, the held-out categories (scissors, refrigerators, and doors) are all part of the PartNet, thus Point-SAM has seen these during training. This weakens the zero-shot transferability of the method. - The process of generating ps
1. The paper extends a powerful 2D segmentation anything model (SAM) into the 3D point cloud domain 2. It introduces a novel data engine to generate multi-level pseudo-labels and augment the training data 3. The interactive segmentation video is impressive, which illustrates its potential real-world application
1. Lack of Visualization Results: The authors provide only a limited number of visualizations in both the main paper and supplementary materials. The qualitative results presented in Figure 4 are inadequate. It is anticipated that more complex examples, such as those depicted in the supplementary videos, could be included. Moreover, several applications discussed in the paper, such as few-shot part segmentation and zero-shot object proposal generation, lack corresponding visualization results.
1. The paper presents a clear writing logic, effectively outlining the challenge to be addressed and the three distinct perspectives of the research. 2. The 3D segmentation task is a crucial direction in embodied intelligence, as it enables machines to understand and interact with complex environments. Moreover, scaling up to achieve 3D foundation models presents significant value.
The main consideration is the technical novelty of this paper, which leads me to feel that the overall contribution is somewhat weak. 1. In terms of model design, the method introduces the Voronoi tokenizer, which is innovative. However, compared to KNN, there is no significant performance improvement; the gains are only noticeable when the number of prompt points is low. 2. Additionally, there are many works utilizing SAM for 3D pseudo-labeling, whether in autonomous driving or in indoor setti
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · 3D Surveying and Cultural Heritage · Remote Sensing and LiDAR Applications
MethodsSegment Anything Model
