Personalize Segment Anything Model with One Shot
Renrui Zhang, Zhengkai Jiang, Ziyu Guo, Shilin Yan, Junting Pan,, Xianzheng Ma, Hao Dong, Peng Gao, Hongsheng Li

TL;DR
This paper introduces PerSAM, a training-free method to personalize the Segment Anything Model for specific visual concepts using only one image, with an optional quick fine-tuning step for improved accuracy, demonstrated on a new dataset and applications.
Contribution
The paper proposes PerSAM, a novel training-free personalization approach for SAM, and PerSAM-F, a one-shot fine-tuning method that requires minimal training time.
Findings
Effective personalization of SAM with a single image.
Competitive performance on video object segmentation.
Enhanced text-to-image generation with personalized models.
Abstract
Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the…
Peer Reviews
Decision·ICLR 2024 poster
The problem of single-shot image segmentation is an important problem to solve. This has many downstream utilities in real-world applications ranging from design to healthcare. And the paper introduces a simple but effective technique to solve this by leveraging the powerful Segment Anything Module (SAM) [1]. The introduced method is called Personalization approach for SAM (PerSAM), and it takes as input a single example image of the desired object we want to segment, and its corresponding seg
Overall it is a nicely written paper, with good results. However, it is somewhat lacking in it's quantitative evaluation. The choice of evaluation datasets is limited. It would be worthwhile to also see the performance of the proposed method for one-shot segmentation on additional (more challenging) datasets like- MS-COCO, AED20K, CityScapes to also compare with more powerful existing state of the art models. Also the comparison is lacking. It would be nice to compare against methods that do
1. The paper is well organised with clear motivation and easy to understand. The illustration and visualisation figures are well presented. 2. PerSAM is training-free and computationally efficient, where the ablation experiment for PerSAM in Table 4, 5 and 6 are extensive. 3. The paper demonstrates good performance not only on the constructed PerSeg benchmark, but also on many image/video segmentation benchmarks.
1. In the appendix, the author mentioned using dinov2 features. Can the authors also provide the results in Table 2 and 3 by using the default image encoder features of SAM? 2. What is the running speed/ memory consumption of PerSAM comparing to SAM? 3. In Table 2, can the author provide performance comparison to SAM-PT [a]? [a] is a related work in adapting SAM for video object segmentation. [a] SAM-PT: Extending SAM to zero-shot video segmentation with point-based tracking. arXiv, 2023. 4.
* This paper is well-written and easy to understand. * This paper first studies an interesting task of customizing a general-purpose segmentation model for personalized scenarios. And the paper presents a highly effective method to address this task. * The method is simple and easy to follow. The proposed PerSAM can guide SAM to segment target objects by three effective training-free techniques. By tuning 2 parameters within 10 seconds, PerSAM-F efficiently alleviates the mask ambiguity issue
The feature semantics of SAM might be limited due to SAM's class-agnostic training. While PerSAM and PerSAM-F demonstrate promising performance in personalized object segmentation, their effectiveness may be constrained by SAM's feature semantics in scenarios involving multiple different objects. This may require additional training to enable better transfer of SAM's features to downstream tasks. Alternatively, introducing other representations with stronger semantics, such as CLIP.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsSegment Anything Model · Test · Diffusion
