Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Zhen Liu; Tim Z. Xiao; Weiyang Liu; Yoshua Bengio; Dinghuai Zhang

arXiv:2412.07775·cs.LG·May 20, 2025

Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets

Zhen Liu, Tim Z. Xiao, Weiyang Liu, Yoshua Bengio, Dinghuai Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Nabla-GFlowNet, a reinforcement learning approach that efficiently finetunes large diffusion models like Stable Diffusion, preserving diversity and priors while optimizing for specific reward functions.

Contribution

It presents a novel gradient-informed GFlowNet method for fast, diversity-preserving diffusion model finetuning based on reward gradients.

Findings

01

Achieves fast finetuning of Stable Diffusion with preserved diversity.

02

Maintains prior knowledge during reward-based finetuning.

03

Effective on various realistic reward functions.

Abstract

While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. In response to this challenge, we take inspiration from recent successes in generative flow networks (GFlowNets) and propose a reinforcement learning method for diffusion model finetuning, dubbed Nabla-GFlowNet (abbreviated as $\nabla$ -GFlowNet), that leverages the rich signal in reward gradients for probabilistic diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The proposed idea is based on the generative flow nets, which makes it intuitive and straightforward. - The Nabla-GFlowNet can leverage the first order information of the reward function (gradient) while the baselines only use the zero-order information. - The experimental results show that the proposed method can generally achieve the best diversity vs. reward trade-off frontiers.

Weaknesses

- I think the "predicted reward" estimation in Eq. 15 can be severely unreliable, especially for the high-noise time-steps of the diffusion model. The predicted clean image will be noisy, and if the reward function is calculated by a model that has been trained on not noisy images, the predicted reward will be inaccurate. - The parameter \lambda and the output regularization described in Page 7 seems to be crucial to the model's performance, but they are not the paper's contribution. - The qua

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper presents a new method for addressing the challenges of fine-tuning multistep sampling in diffusion models using GFlowNets. This method effectively eliminates the need to train a reward model that processes noisy input. 2. This paper implements their idea in both theoretical and practical contexts. Section 3.1 covers the theoretical aspect, while sections 3.2 and 3.3 address the practical application.

Weaknesses

The main weakness is in the experiment part. 1. The function $g_\phi(x_t)$ is an interesting and reasonable choice for achieving the fitness task; however, it results in approximately zero vectors, with a terminal constraint of $g_\phi(x_T) = 0$. It remains unclear whether Unet is a suitable option for this purpose. 2. The regularization term appears significant, with $\lambda=1000$ in the Aesthetic Score experiments and $\lambda=100$ in the HPSv2 experiments. However, Section 3.2 states that it

Reviewer 03Rating 6Confidence 2

Strengths

- The paper offers a comprehensive theoretical deduction of the proposed method, thoroughly explaining how the objectives nabla-DB and residual nabla-DB are derived. - By introducing residual ∇-DB, the authors extend the applicability of their work to pretrained large-scale models, which is crucial. - The paper enhances the quantitative evaluation of diversity in generated samples. By employing a broader range of metrics and more extensive comparisons.

Weaknesses

- The current experimental setting appears somewhat outdated. To enhance the study's relevance, please consider using more recent schedulers and pre-trained models instead of DDPM or Stable Diffusion 1.5. - The qualitative results shown in Figure 2 are confusing. Additional explanation is needed to clearly demonstrate the superiority of ∇-DB, as DDPO and DAG-DB also exhibit strong performance. - A user study would be helpful for evaluating diversity.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research

MethodsALIGN · Diffusion