Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable   Latent-Space Surrogate Reward

Zhiwei Jia; Yuesong Nan; Huixi Zhao; Gengdai Liu

arXiv:2411.15247·cs.LG·March 12, 2025

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

Zhiwei Jia, Yuesong Nan, Huixi Zhao, Gengdai Liu

PDF

Open Access

TL;DR

This paper introduces LaSRO, a novel method for fine-tuning diffusion models using learned differentiable surrogate rewards in latent space, enabling ultra-fast image generation with improved efficiency and stability.

Contribution

LaSRO is the first approach to learn surrogate rewards in latent space for fine-tuning diffusion models, outperforming existing RL methods in ultra-fast image generation.

Findings

01

LaSRO outperforms PPO, DPO, DDPO, and Diffusion-DPO in ultra-fast image generation tasks.

02

LaSRO effectively converts arbitrary rewards into differentiable forms for gradient-based optimization.

03

Theoretical analysis links LaSRO to value-based reinforcement learning.

Abstract

Recent research has shown that fine-tuning diffusion models (DMs) with arbitrary rewards, including non-differentiable ones, is feasible with reinforcement learning (RL) techniques, enabling flexible model alignment. However, applying existing RL methods to step-distilled DMs is challenging for ultra-fast ( $\leq 2$ -step) image generation. Our analysis suggests several limitations of policy-based RL methods such as PPO or DPO toward this goal. Based on the insights, we propose fine-tuning DMs with learned differentiable surrogate rewards. Our method, named LaSRO, learns surrogate reward models in the latent space of SDXL to convert arbitrary rewards into differentiable ones for effective reward gradient guidance. LaSRO leverages pre-trained latent DMs for reward modeling and tailors reward optimization for $\leq 2$ -step image generation with efficient off-policy exploration. LaSRO is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization · Diffusion