SLiMe: Segment Like Me
Aliasghar Khani, Saeid Asgari Taghanaki, Aditya Sanghi, Ali Mahdavi, Amiri, Ghassan Hamarneh

TL;DR
SLiMe is a novel method that leverages large vision-language models and attention maps to perform image segmentation at any granularity using only one annotated example, with improved performance in few-shot scenarios.
Contribution
SLiMe introduces a new approach that uses attention maps and text embedding optimization for one-shot and few-shot image segmentation, outperforming existing methods.
Findings
SLiMe effectively segments images with just one example.
Performance improves with additional training data.
Outperforms existing one-shot and few-shot segmentation methods.
Abstract
Significant strides have been made using large vision-language models, like Stable Diffusion (SD), for a variety of downstream tasks, including image editing, image correspondence, and 3D shape generation. Inspired by these advancements, we explore leveraging these extensive vision-language models for segmenting images at any desired granularity using as few as one annotated sample by proposing SLiMe. SLiMe frames this problem as an optimization task. Specifically, given a single training image and its segmentation mask, we first extract attention maps, including our novel "weighted accumulated self-attention map" from the SD prior. Then, using the extracted attention maps, the text embeddings of Stable Diffusion are optimized such that, each of them, learn about a single segmented region from the training image. These learned embeddings then highlight the segmented region in the…
Peer Reviews
Decision·ICLR 2024 poster
1: SLiMe introduces a unique one-shot optimization strategy for image segmentation, which is useful when the available data is limited. 2: The proposed method demonstrates superior performance over existing one- and few-shot segmentation methods in various tests, indicating its practical applicability and effectiveness. 3: The paper showcases the method's versatility by successfully applying it to different objects and granularity levels, emphasizing its broad applicability.
My main concerns focus on the **text prompt**. 1: The introduction of the text prompt is quite abrupt. In the Introduction, SLiMe is described as requiring only an image and a corresponding mask to achieve segmentation of any granularity. However, immediately after, the author talks about fine-tuning text embeddings. What is the definition of 'text' in this task? How are text embeddings obtained? And do different granularities correspond to the same text? The author is encouraged to provide fur
- A novel and interesting idea. The idea of retargeting the feature representation in a pretrained generative model for few-shot semantic segmentation is not new. But it's novel to exploit the text branch in the text-based image generation model (i.e. Stable Diffusion). Instead of training the adaptor model to map between the generative features to the semantic masks which may overfit on input image features, the proposed method aims to find the text embedding which may be more generalizable. -
- Unconvincing importance of WAS attention: Figure 1 shows a good intuition that we need WAS attention to refine the object boundary. However, In Table 5, the contribution of the WAS attention doesn't seem to be significant. For the without WAS attention results (fig. 2a, 2c), will they be much improved by using GrabCut or other methods for segmentation refinement post-processing? - Lack of comparison on benchmark datasets like ADE-Bedroom-30 (used by segDDPM) and FSS-1000 (used in segGPT). The
1. The proposed idea of optimizing the text embedding based on the attention maps for semantic segmentation is interesting and novel. 2. The proposed method is shown to be advantageous in two datasets both quantitatively and qualitatively. 3. The paper is well-written and easy to follow. The theoretical background is explained well. 4. The limitations are discussed. 5. The method is well-ablated for the different components.
1. The proposed method seems to be adapted for the segmentation task based on the work by Hedlin et al for unsupervised semantic correspondence. 2. Related works which are not cited: [a] Burgert, Ryan, et al. "Peekaboo: Text to image diffusion models are zero-shot segmentors." arXiv preprint arXiv:2211.13224 (2022). [b] Tian, Junjiao, et al. "Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion." arXiv preprint arXiv:2308.12469 (2023). 3. The number of works
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
