TL;DR
SAM-CP enhances the Segment Anything model by introducing composable prompts for versatile segmentation, enabling it to perform semantic, instance, and panoptic segmentation with state-of-the-art open-vocabulary results.
Contribution
The paper proposes a novel framework that integrates composable prompts with SAM, allowing for multi-grained semantic perception and improved segmentation performance.
Findings
Achieves state-of-the-art open-vocabulary segmentation performance.
Effectively performs semantic, instance, and panoptic segmentation.
Introduces a unified affinity-based framework for composable prompts.
Abstract
The Segment Anything model (SAM) has shown a generalized ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges. This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation. Specifically, given a set of classes (in texts) and a set of SAM patches, the Type-I prompt judges whether a SAM patch aligns with a text label, and the Type-II prompt judges whether two SAM patches with the same text label also belong to the same instance. To decrease the complexity in dealing with a large number of semantic classes and patches, we establish a unified framework that calculates the affinity between (semantic and instance) queries and SAM patches and merges patches with high affinity to the query. Experiments show that SAM-CP achieves semantic,…
Peer Reviews
Decision·ICLR 2025 Poster
- The overall architecture designs are well-grounded. The two types of prompts for semantic and instance understanding are intuitively designed to support different segmentation tasks. - SAM-CP models the two types of prompts with separated queries for better efficiency. The quantitative results on open-vocabulary segmentation also demonstrate the effectiveness of this query-based approach. - Thorough analysis is provided for the advantages and limitations of the proposed method.
- The running time might be a potential issue due to the two-level prompt design (especially if an iterative approach is used). A comparison of running time performance with the original SAM and other segmentation models would be informative. - Experiments: The current baselines do not include models based on SAM. Therefore, it's unclear whether the performance improvements come from the composable prompts or the internal capacity of SAM. Adding baselines building on SAM that perform instance [
1. The proposed model keeps the zero-shot abilities of CLIP. This is supported by experiments on open-vocabulary segmentations. 2. The ablation studies are sufficient and well explain the design choice of loss functions. 3. This method is easy to follow, with many intuitive figures and visualizations.
1. The experiments on open-vocabulary segmentations (Table 1) are not well supportive. SAM-CP combines SAM and CLIP but is only comparable to FC-CLIP that is built with CLIP only. Also, the experiments on closed-vocabulary segmentations (Table 2) do not best address the potential of combining SAM and CLIP since it is worse than CLIP-only X-Decoder and supervised SOTA MaskDINO. 2. The advantages of the proposed framework are not well discussed or presented. The performances are only comparable to
1. This work presents a method for improving SAM for versatile segmentation. 2. SAM-CP considers segmentation tasks from a novel perspective, introducing two types of composable prompts: Type-I for matching SAM patches with text labels and Type-II for merging patches belonging to the same instance. 3. The dynamic mechanism offers novel design insights for handling cross-attention computations. 4. Generally, the paper is well-written. 5. The figures and tables are informative.
1. Prompt I classifies each SAM-segmented patch, potentially leading to ambiguities. For example, without contextual information, it’s challenging to determine if a patch representing clothing on a person should be classified as part of the person or as an independent clothing item. The proposed CP lacks a clear strategy to handle the inherent ambiguity in segmentation without surrounding context. 2. The performance gains are not significant, such as mIoU for the COCO->ADE20K and COCO->Cityscap
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Segment Anything Model
