Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets
Zhen Liu, Tim Z. Xiao, Weiyang Liu, Yoshua Bengio, Dinghuai Zhang

TL;DR
This paper introduces Nabla-GFlowNet, a reinforcement learning approach that efficiently finetunes large diffusion models like Stable Diffusion, preserving diversity and priors while optimizing for specific reward functions.
Contribution
It presents a novel gradient-informed GFlowNet method for fast, diversity-preserving diffusion model finetuning based on reward gradients.
Findings
Achieves fast finetuning of Stable Diffusion with preserved diversity.
Maintains prior knowledge during reward-based finetuning.
Effective on various realistic reward functions.
Abstract
While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. In response to this challenge, we take inspiration from recent successes in generative flow networks (GFlowNets) and propose a reinforcement learning method for diffusion model finetuning, dubbed Nabla-GFlowNet (abbreviated as -GFlowNet), that leverages the rich signal in reward gradients for probabilistic diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving…
Peer Reviews
Decision·ICLR 2025 Poster
- The proposed idea is based on the generative flow nets, which makes it intuitive and straightforward. - The Nabla-GFlowNet can leverage the first order information of the reward function (gradient) while the baselines only use the zero-order information. - The experimental results show that the proposed method can generally achieve the best diversity vs. reward trade-off frontiers.
- I think the "predicted reward" estimation in Eq. 15 can be severely unreliable, especially for the high-noise time-steps of the diffusion model. The predicted clean image will be noisy, and if the reward function is calculated by a model that has been trained on not noisy images, the predicted reward will be inaccurate. - The parameter \lambda and the output regularization described in Page 7 seems to be crucial to the model's performance, but they are not the paper's contribution. - The qua
1. This paper presents a new method for addressing the challenges of fine-tuning multistep sampling in diffusion models using GFlowNets. This method effectively eliminates the need to train a reward model that processes noisy input. 2. This paper implements their idea in both theoretical and practical contexts. Section 3.1 covers the theoretical aspect, while sections 3.2 and 3.3 address the practical application.
The main weakness is in the experiment part. 1. The function $g_\phi(x_t)$ is an interesting and reasonable choice for achieving the fitness task; however, it results in approximately zero vectors, with a terminal constraint of $g_\phi(x_T) = 0$. It remains unclear whether Unet is a suitable option for this purpose. 2. The regularization term appears significant, with $\lambda=1000$ in the Aesthetic Score experiments and $\lambda=100$ in the HPSv2 experiments. However, Section 3.2 states that it
- The paper offers a comprehensive theoretical deduction of the proposed method, thoroughly explaining how the objectives nabla-DB and residual nabla-DB are derived. - By introducing residual ∇-DB, the authors extend the applicability of their work to pretrained large-scale models, which is crucial. - The paper enhances the quantitative evaluation of diversity in generated samples. By employing a broader range of metrics and more extensive comparisons.
- The current experimental setting appears somewhat outdated. To enhance the study's relevance, please consider using more recent schedulers and pre-trained models instead of DDPM or Stable Diffusion 1.5. - The qualitative results shown in Figure 2 are confusing. Additional explanation is needed to clearly demonstrate the superiority of ∇-DB, as DDPO and DAG-DB also exhibit strong performance. - A user study would be helpful for evaluating diversity.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research
MethodsALIGN · Diffusion
