Aligning Diffusion Models by Optimizing Human Utility
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato,, Kazuki Kozuka

TL;DR
Diffusion-KTO is a new method for aligning text-to-image diffusion models by maximizing expected human utility using simple binary feedback, improving alignment without complex reward models.
Contribution
It introduces a straightforward alignment approach that leverages binary feedback signals, avoiding the need for costly preference data or reward model training.
Findings
Outperforms existing methods in human judgment and automatic metrics.
Requires only simple binary feedback signals for alignment.
Enhances the applicability of aligning diffusion models with human preferences.
Abstract
We present Diffusion-KTO, a novel approach for aligning text-to-image diffusion models by formulating the alignment objective as the maximization of expected human utility. Since this objective applies to each generation independently, Diffusion-KTO does not require collecting costly pairwise preference data nor training a complex reward model. Instead, our objective requires simple per-image binary feedback signals, e.g. likes or dislikes, which are abundantly available. After fine-tuning using Diffusion-KTO, text-to-image diffusion models exhibit superior performance compared to existing techniques, including supervised fine-tuning and Diffusion-DPO, both in terms of human judgment and automatic evaluation metrics such as PickScore and ImageReward. Overall, Diffusion-KTO unlocks the potential of leveraging readily available per-image binary signals and broadens the applicability of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques
MethodsDiffusion
