SEE-DPO: Self Entropy Enhanced Direct Preference Optimization
Shivanshu Shekhar, Shreyas Singh, Tong Zhang

TL;DR
This paper introduces a self-entropy regularization mechanism to improve the stability and robustness of DPO-based training for diffusion models, reducing overfitting and reward hacking, and enhancing image quality and diversity.
Contribution
The paper proposes a novel self-entropy regularization method that stabilizes DPO training for diffusion models, addressing overfitting and reward hacking issues.
Findings
Enhanced image diversity and specificity.
Achieved state-of-the-art results on key metrics.
Improved training stability and robustness.
Abstract
Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms
MethodsDiffusion · ALIGN · Direct Preference Optimization
