Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Tao Ren, Zishi Zhang, Jingyang Jiang, Zehao Li, Shentao Qin, Yi Zheng, Guanghao Li, Qianyou Sun, Yan Li, Jiafeng Liang, Xinping Li, Yijie Peng

TL;DR
This paper introduces the Recursive Likelihood Ratio optimizer, a novel half-order fine-tuning method for diffusion models that improves training efficiency and stability by providing unbiased, low-variance gradient estimates, validated through extensive experiments.
Contribution
The paper proposes a new half-order fine-tuning paradigm with an unbiased gradient estimator for diffusion models, addressing limitations of existing methods like RL and truncated BP.
Findings
RLR outperforms RL and truncated BP in efficiency and stability.
Theoretical analysis confirms lower variance and unbiasedness of RLR.
Experimental results show superior image and video generation quality.
Abstract
The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the…
Peer Reviews
Decision·ICLR 2026 Oral
- The paper presents a novel fine-tuning scheme and devises gradient estimators for the diffusion model’s chain-of-thought, which appears genuinely innovative. - The theoretical analysis is careful and offers a credible justification for the proposed approach.
- Missing a comparison with related diffusion model fine-tuning baselines, such as D3PO[1]. - The experiments are limited to SD 1.4 and SD 2.0, which are now dated. Moreover, the method’s generalization to the Flux architecture remains unclear. $\text{[1] Yang, Kai, et al. "Using human feedback to fine-tune diffusion models without any reward model." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.}$
- This paper addresses the crucial task of efficiently aligning foundation diffusion models. This represents a highly important and practically valuable problem. - This paper proposes the Recursive Likelihood Ratio (RLR) optimizer, a novel "Half-Order" (HO) fine-tuning paradigm that successfully overcomes these challenges. T - he paper also introduces a novel prompt technique that synergizes naturally with the RLR optimizer, further enhancing the originality of the contribution. - The paper ri
- The paper exhibits significant inconsistencies in its core methodology description, particularly regarding the sampling strategy for the Half-Order (HO) sub-chain starting point, $j$. In Section 4.2 (Methodology), the paper describes $j$ as being sampled from a categorical distribution based on gradient norms. However, in Section 5.3 (DCoT Experiment), $j$ is described as being selected from a uniform distribution ($j \sim \mathcal{U}(1, T-h)$). This contradictory description makes it impossib
1. The innovative concept of "half-order" fine-tuning paradigm is proposed, which fills the gap between traditional first-order and zero-order methods. 2. FO, HO and ZO complement each other's strengths and find the optimal balance between variance and computational cost by optimizing the h and j parameters, taking into account the actual computational budget constraints.
1. The problem with FO is its high cost, and the problem with ZO is its high variance, but the author does not provide a clear analysis to explain this. For example, for a specific scenario, how many NFEs are needed for FO, ZO, and HO respectively, how is this calculated, what are the variances of these three, and why is there such a large variance problem. I think this needs a clearer analysis. 2. The visualization results look average, and the improvement is not significant enough. Lacks compa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Advanced Mathematical Modeling in Engineering
MethodsALIGN · Diffusion
