PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models
Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou

TL;DR
PRDP introduces a stable, supervised reward difference prediction method for large-scale reward finetuning of diffusion models, outperforming RL-based methods on complex, unseen prompts in vision tasks.
Contribution
The paper proposes PRDP, a novel reward difference prediction approach that stabilizes large-scale reward finetuning of diffusion models, enabling better generalization to complex prompts.
Findings
PRDP matches RL methods in small-scale reward maximization.
PRDP outperforms RL in large-scale training on unseen prompts.
PRDP achieves higher quality image generation on diverse prompts.
Abstract
Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsSparse Evolutionary Training · Diffusion
