What happens to diffusion model likelihood when your model is conditional?
Mattias Cross, Anton Ragni

TL;DR
This paper investigates how likelihoods derived from diffusion models behave in conditional settings like TTI and TTS, revealing they are less sensitive to conditioning inputs than previously assumed.
Contribution
It uncovers the properties and limitations of diffusion model likelihoods in conditional tasks, highlighting their insensitivity to conditioning inputs.
Findings
TTS diffusion likelihoods are agnostic to text input.
TTI likelihoods are more expressive but cannot detect confounding prompts.
Conditional diffusion likelihoods are less sensitive than expected.
Abstract
Diffusion Models (DMs) iteratively denoise random samples to produce high-quality data. The iterative sampling process is derived from Stochastic Differential Equations (SDEs), allowing a speed-quality trade-off chosen at inference. Another advantage of sampling with differential equations is exact likelihood computation. These likelihoods have been used to rank unconditional DMs and for out-of-domain classification. Despite the many existing and possible uses of DM likelihoods, the distinct properties captured are unknown, especially in conditional contexts such as Text-To-Image (TTI) or Text-To-Speech synthesis (TTS). Surprisingly, we find that TTS DM likelihoods are agnostic to the text input. TTI likelihood is more expressive but cannot discern confounding prompts. Our results show that applying DMs to conditional tasks reveals inconsistencies and strengthens claims that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neuroimaging Techniques and Applications
MethodsDiffusion
