Aligning Diffusion Models with Noise-Conditioned Perception
Alexander Gambashidze, Anton Kulikov, Yuriy Sosnin, Ilya Makarov

TL;DR
This paper introduces a perceptual objective in the U-Net embedding space for diffusion models, improving human preference alignment, training efficiency, and visual quality compared to traditional pixel or VAE space optimization.
Contribution
It proposes a novel perceptual optimization approach in the U-Net embedding space for diffusion models, enhancing preference alignment and reducing training costs.
Findings
Outperforms standard latent-space methods in quality and efficiency
Achieves over 60% preference and visual appeal on SDXL
Reduces computational cost significantly during training
Abstract
Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual appeal, and user preference. Unlike LMs, Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception, leading to slower and less efficient training during the preference alignment stage. We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues. Our approach involves fine-tuning Stable Diffusion 1.5 and XL using Direct Preference Optimization (DPO), Contrastive Preference Optimization (CPO), and supervised fine-tuning (SFT) within this embedding space. This method significantly outperforms standard latent-space implementations across various metrics, including quality and computational cost. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · Max Pooling · ALIGN · U-Net · Diffusion
