CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion
Cameron Braunstein, Mariya Toneva, and Eddy Ilg

TL;DR
This paper reveals that the semantic understanding in Stable Diffusion's image generation primarily originates from the CLIP model's representations, with the diffusion process acting mainly as a visual decoder.
Contribution
The study demonstrates that CLIP's semantic representations are the core source of human-like understanding in Stable Diffusion, challenging previous assumptions about the diffusion process.
Findings
Semantic attributes are better decoded from CLIP than the diffusion process.
Attributes become harder to disambiguate during inverse diffusion.
CLIP's representations are key to the model's semantic understanding.
Abstract
Latent diffusion models such as Stable Diffusion achieve state-of-the-art results on text-to-image generation tasks. However, the extent to which these models have a semantic understanding of the images they generate is not well understood. In this work, we investigate whether the internal representations used by these models during text-to-image generation contain semantic information that is meaningful to humans. To do so, we perform probing on Stable Diffusion with simple regression layers that predict semantic attributes for objects and evaluate these predictions against human annotations. Surprisingly, we find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process. We demonstrate that groups of specific semantic attributes have markedly different decoding accuracy than the average, and are thus represented to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
