Unsupervised Semantic Correspondence Using Stable Diffusion
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar,, Andrea Tagliasacchi, Kwang Moo Yi

TL;DR
This paper demonstrates that pre-trained text-to-image diffusion models can be used in an unsupervised manner to find semantic correspondences across images by optimizing prompt embeddings, achieving competitive results without additional training.
Contribution
The authors introduce a novel unsupervised method leveraging diffusion models' semantic understanding to find image correspondences without training.
Findings
Achieves state-of-the-art results on PF-Willow dataset.
Outperforms existing weakly and unsupervised methods on multiple datasets.
Uses optimized prompt embeddings to capture semantic regions.
Abstract
Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences - locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsDiffusion
