Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning
Hanyang Zhao, Haoxian Chen, Ji Zhang, David D. Yao, Wenpin Tang

TL;DR
This paper introduces a continuous-time reinforcement learning approach to fine-tune diffusion models, reducing discretization errors and improving alignment with input prompts, demonstrated on large-scale Text2Image models.
Contribution
It develops a novel continuous-time RL framework for diffusion model fine-tuning, connecting score matching with policy optimization and regularization.
Findings
Improved fine-tuning of diffusion models with continuous-time RL.
Enhanced value network design leveraging diffusion model structure.
Validated on large-scale Text2Image models.
Abstract
Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area use a discrete-time formulation, which is prone to induced discretization errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby making connections to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Stock Market Forecasting Methods · Reinforcement Learning in Robotics
MethodsDiffusion
