SMITE: Segment Me In TimE
Amirhossein Alimohammadi, Sauradip Nag, Saeid Asgari Taghanaki, Andrea, Tagliasacchi, Ghassan Hamarneh, Ali Mahdavi Amiri

TL;DR
This paper introduces SMITE, a novel video segmentation method that combines a pre-trained text-to-image diffusion model with tracking to handle arbitrary segmentation granularity and limited sample masks.
Contribution
The paper presents a new approach integrating diffusion models with tracking for flexible, sample-efficient video segmentation, outperforming existing methods.
Findings
Effective segmentation across various scenarios
Outperforms state-of-the-art methods
Handles arbitrary granularity and limited samples
Abstract
Segmenting an object in a video presents significant challenges. Each pixel must be accurately labelled, and these labels must remain consistent across frames. The difficulty increases when the segmentation is with arbitrary granularity, meaning the number of segments can vary arbitrarily, and masks are defined based on only one or a few sample images. In this paper, we address this issue by employing a pre-trained text to image diffusion model supplemented with an additional tracking mechanism. We demonstrate that our approach can effectively manage various segmentation scenarios and outperforms state-of-the-art alternatives.
Peer Reviews
Decision·ICLR 2025 Poster
Belows are the strong points that this paper has: - It is well-written and well-organized, which help reader to understand new task and motivation of the proposed method. - The proposed techniques including learning generalizable segments using alternative tuning of text embeddings and cross attention, temporal consistency via tracking voting mechanism, and spatial consistency with low-pass filter are in good harmony, which can outperform the other baselines with high margin.
Belows are points that the reviewer feel concerned about. - In line #247, the authors mention that learning text embeddings and cross-attention layers is conducted in two phases to provide a better initialization for the next phase. However, this phased or alternating training approach could also increase training complexity. Could the authors provide an ablation study to compare the results of joint training versus alternating training to justify the chosen training setup? Specifically, analys
1. Originality : The proposed method innovatively combines pre-trained text-to-image diffusion models with a temporal tracking mechanism to achieve video segmentation with flexible granularity using only a few reference images. 1. Quality : The comprehensive experiments, including quantitative evaluations on the new SMITE-50 dataset, qualitative comparisons, ablation studies, and user studies, provide solid support for the method's effectiveness. 1. Clarity : The paper is well-structured, with c
1. Methodological Clarity: While the overall explanation is detailed, some of the methodology is difficult to follow, especially for readers not familiar with diffusion models and the previous work SLiMe. The process of how tracking integrates into the attention map refinement could be clearer, with more intuitive descriptions of its working mechanism. 1. Dataset & Baselines: The new SMITE-50 dataset is an important contribution, but it appears somewhat small (with 50 videos). Expanding the data
1)This proposed method combines information from text, tracker and spatial temporal features for video segmentation at arbitrary level. 2)SMITE achieves state of the art performance on the SMITE-50 dataset.
1)The proposed method is only compared with two methods in Table 1, while there are lots of few-shot and zero-shot segmentation methods. Besides, it seems that the proposed method can be directly applied on multi-target video segmentation datasets like DAVIS2017. So, there should be more comprehensive comparison, for both methods and datasets. 2)Speed is important for online tracking; so there should be some runtime analysis. 3)Limitations of the method is not discussed in experiments. 4)How do
Code & Models
Videos
Taxonomy
TopicsMedical Image Segmentation Techniques · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
