TL;DR
SeeDiff leverages attention mechanisms in Stable Diffusion to generate high-quality semantic segmentation masks without extra training, prompt tuning, or pre-trained models, by extracting initial seeds and expanding them through self-attention.
Contribution
This work introduces SeeDiff, a novel method that exploits attention in diffusion models for off-the-shelf semantic mask generation without additional training.
Findings
Achieves high-quality masks without training or prompt tuning.
Utilizes cross-attention for initial object seeds.
Expands seeds using multi-scale self-attention for full object coverage.
Abstract
Entrusted with the goal of pixel-level object classification, the semantic segmentation networks entail the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
