Data-regularized Reinforcement Learning for Diffusion Models at Scale

Haotian Ye; Kaiwen Zheng; Jiashu Xu; Puheng Li; Huayu Chen; Jiaqi Han; Sheng Liu; Qinsheng Zhang; Hanzi Mao; Zekun Hao; Prithvijit Chattopadhyay; Dinghao Yang; Liang Feng; Maosheng Liao; Junjie Bai; Ming-Yu Liu; James Zou; Stefano Ermon

arXiv:2512.04332·cs.LG·December 25, 2025

Data-regularized Reinforcement Learning for Diffusion Models at Scale

Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, Liang Feng, Maosheng Liao, Junjie Bai, Ming-Yu Liu, James Zou, Stefano Ermon

PDF

Open Access

TL;DR

This paper introduces DDRL, a new reinforcement learning framework for diffusion models that uses data regularization to improve alignment with human preferences, reduce reward hacking, and enhance scalability.

Contribution

The paper proposes DDRL, a novel data-regularized RL approach for diffusion models, providing theoretical robustness and empirical improvements in high-resolution video generation.

Findings

01

Significantly improves reward alignment in diffusion models.

02

Reduces reward hacking and quality degradation.

03

Achieves highest human preference scores in experiments.

Abstract

Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning