Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward
Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu

TL;DR
This paper introduces LaSRO, a novel method for fine-tuning diffusion models using learned differentiable surrogate rewards in latent space, enabling ultra-fast image generation with improved efficiency and stability.
Contribution
LaSRO is the first approach to learn surrogate rewards in latent space for fine-tuning diffusion models, outperforming existing RL methods in ultra-fast image generation.
Findings
LaSRO outperforms PPO, DPO, DDPO, and Diffusion-DPO in ultra-fast image generation tasks.
LaSRO effectively converts arbitrary rewards into differentiable forms for gradient-based optimization.
Theoretical analysis links LaSRO to value-based reinforcement learning.
Abstract
Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to step-distilled DMs is challenging for ultra-fast (-step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for effective reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and tailors reward optimization for -step image generation with efficient off-policy exploration. LaSRO is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization · Diffusion
