Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Hyeongyu Kang; Jaewoo Lee; Woocheol Shin; Kiyoung Om; Jinkyoo Park

arXiv:2512.04559·cs.LG·March 9, 2026

Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SQDF, a novel reinforcement learning-based fine-tuning method for diffusion models that improves alignment with downstream objectives while maintaining diversity and naturalness.

Contribution

We propose SQDF, a KL-regularized policy gradient method with innovations like a discount factor, consistency models, and replay buffer to enhance diffusion model fine-tuning.

Findings

01

SQDF achieves higher target rewards in text-to-image tasks.

02

SQDF maintains diversity and naturalness better than existing methods.

03

SQDF demonstrates high sample efficiency in black-box optimization.

Abstract

Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Substantial baselines - they compared against several past works. They tested whether a simple addition to baselines (adding KL regularization) would improve them, and found that their method was still superior. - Ablations - removed each component they added to confirm that it contributes to the final result. - Clarity - the paper is clear and easy to read. The diagrams make sense. Significant improvement above prior work, which is visible both in the qualitative results and the graphs.

Weaknesses

See “questions” section for more details. It is not clear to me that the minor improvements in diversity produced by the buffer (.01 to .02 in Table 2) is worth the extra complexity adding it to the algorithm requires and the hit to aesthestic score. Similarly it seems like without the consistency model diversity improves (though this is a minor effect, the improvements to aesthetic score I think suggest the CM is still a good contribution to the algorithm).

Reviewer 02Rating 6Confidence 4

Strengths

1. Reward over-optimization has been a well-known problem in diffusion fine-tuning, and the paper proposes a well-structured method with results to support their claims. 2. Each component of the method is well-motivated, and evidences are provided through ablation studies. 3. Both quantitative and qualitative results show better reward-alignment / reward-diversity compared to prior works.

Weaknesses

1. Lack of fine-tuning baseline: [1] have already proposed RL fine-tuning that focuses on mitigating over-optimization. 2. Lack of training-free baselines: [2] has shown that training-free methods can mitigate over-optimization compared to fine-tuning, and [3], [4] evidence soft value function can also be used for training-free methods. Without comparison with these methods or justification statements, it's unclear why fine-tuning is necessary. [1] Zhang, Ziyi, et al. "Confronting reward overop

Reviewer 03Rating 6Confidence 2

Strengths

1. The training-free soft Q-function approximation eliminates the need for unstable value network training. 2. The paper demonstrates effectiveness across multiple tasks (aesthetic scoring, human preference optimization, black-box settings) with thorough comparisons against relevant baselines and ablation studies validating each component. 3. Results show SQDF achieves better Pareto frontiers, optimizing target rewards while maintaining significantly better alignment and diversity metrics comp

Weaknesses

1. While using the consistency model to estimate the future value is intriguing, it also makes one wonder what about directly fine-tuning the consistency model, which could potentially achieve similar results more efficiently given its single-step generation capability. Also see Q1. 2. While methodologically sound, the paper provides little discussion of computational overhead, training time comparisons, or memory requirements.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neuroimaging Techniques and Applications · Domain Adaptation and Few-Shot Learning