SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Shivanshu Shekhar; Shreyas Singh; Tong Zhang

arXiv:2411.04712·cs.CV·October 7, 2025

SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Shivanshu Shekhar, Shreyas Singh, Tong Zhang

PDF

Open Access

TL;DR

This paper introduces a self-entropy regularization mechanism to improve the stability and robustness of DPO-based training for diffusion models, reducing overfitting and reward hacking, and enhancing image quality and diversity.

Contribution

The paper proposes a novel self-entropy regularization method that stabilizes DPO training for diffusion models, addressing overfitting and reward hacking issues.

Findings

01

Enhanced image diversity and specificity.

02

Achieved state-of-the-art results on key metrics.

03

Improved training stability and robustness.

Abstract

Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms

MethodsDiffusion · ALIGN · Direct Preference Optimization