Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey, Levine

TL;DR
This paper introduces a reinforcement learning approach called DDPO to optimize diffusion models directly for specific objectives like image quality and human feedback, surpassing traditional likelihood-based training methods.
Contribution
It proposes a novel policy gradient method for diffusion models, enabling direct optimization for complex, human-centric objectives without additional data collection.
Findings
DDPO effectively adapts text-to-image models to new objectives.
It improves aesthetic quality and image compressibility.
DDPO enhances prompt-image alignment using existing feedback models.
Abstract
Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as…
Peer Reviews
Decision·ICLR 2024 poster
This paper is well-written, with clear logic and beautifully crafted figures, making it easy to follow the authors' line of reasoning. Additionally, the paper is well-structured, presenting an easy-to-follow approach to fine-tuning diffusion models using reinforcement learning for alignment. The narrative is straightforward, and the methods described are, in my opinion, sensible.
1) In terms of image generation quality, the paper lacks a quantitative and qualitative comparison with recent works. It fails to provide experimental support for its effectiveness. Specifically, in the absence of comparisons in image quality with all methods related to "Optimizing diffusion models using policy gradients," it is challenging to discern the improvements this paper offers over baseline approaches. This makes it difficult to evaluate the paper's contribution to the community. 2) Re
- This paper is clearly written and easy to follow. - The problem of solving diffusion generative processes as solving MDP is clearly stated, and the proposed method generalizes prior works (i.e., reward-weighted regression (RWR)) for the multi-step MDP case. - The experiments clearly validate the efficiency of DDPO over prior diffusion model tuning with reward functions as well as detailed algorithmic choices are provided.
- After RL fine-tuning, the generated images seem to be saturated. For example, the fine-tuned models generate images with high aesthetic scores, but they seem to generate images with similar backgrounds of sunset. For prompt alignment experiments, the models generate cartoon-like images. - I think one of the main contributions of the paper is on utilizing VLMs for optimizing text-to-image diffusion models. In this context, the discussion on the choice of reward function should be discussed mor
- the paper proposes a novel work (other work mentioned in related works section is concurrent work as it will only be published at Neurips in December) - the approach is simple yet effective - a variety of reward functions are explored and all yield visually pleasing results
- The proposed method does not consider the problem of overoptimisation, instead the authors argue that early stopping from visual inspection is sufficient. This makes the method applicable to problems where visual inspection is possible (which is likely the case for many image tasks). However, it renders the method inapplicable to problems where the needed visual inspection is not possible. (E.g. one might not be able to apply this method to a medical imaging task where visual inspection by a h
Code & Models
- 🤗kvablack/ddpo-aestheticmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗kvablack/ddpo-alignmentmodel· 8 dl· ♡ 78 dl♡ 7
- 🤗kvablack/ddpo-compressibilitymodel· 6 dl· ♡ 46 dl♡ 4
- 🤗kvablack/ddpo-incompressibilitymodel· 6 dl· ♡ 16 dl♡ 1
- 🤗alkzar90/ddpo-aesthetic-celebahq-256model· 18 dl· ♡ 118 dl♡ 1
- 🤗alkzar90/ddpo-compressibility-celebahq-256model· 4 dl4 dl
- 🤗alkzar90/ddpo-incompressibility-celebahq-256model· 3 dl3 dl
- 🤗alkzar90/ddpo-aesthetic-church-256model· 5 dl5 dl
- 🤗alkzar90/ddpo-compressibility-church-256model· 3 dl3 dl
- 🤗alkzar90/ddpo-incompressibility-church-256model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Aesthetic Perception and Analysis · Recommender Systems and Techniques
MethodsDiffusion
