Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Tao Ren; Zishi Zhang; Jingyang Jiang; Zehao Li; Shentao Qin; Yi Zheng; Guanghao Li; Qianyou Sun; Yan Li; Jiafeng Liang; Xinping Li; Yijie Peng

arXiv:2502.00639·cs.CV·September 30, 2025

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Tao Ren, Zishi Zhang, Jingyang Jiang, Zehao Li, Shentao Qin, Yi Zheng, Guanghao Li, Qianyou Sun, Yan Li, Jiafeng Liang, Xinping Li, Yijie Peng

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Recursive Likelihood Ratio optimizer, a novel half-order fine-tuning method for diffusion models that improves training efficiency and stability by providing unbiased, low-variance gradient estimates, validated through extensive experiments.

Contribution

The paper proposes a new half-order fine-tuning paradigm with an unbiased gradient estimator for diffusion models, addressing limitations of existing methods like RL and truncated BP.

Findings

01

RLR outperforms RL and truncated BP in efficiency and stability.

02

Theoretical analysis confirms lower variance and unbiasedness of RLR.

03

Experimental results show superior image and video generation quality.

Abstract

The probabilistic diffusion model (DM), generating content by inferencing through a recursive chain structure, has emerged as a powerful framework for visual generation. After pre-training on enormous data, the model needs to be properly aligned to meet requirements for downstream applications. How to efficiently align the foundation DM is a crucial task. Contemporary methods are either based on Reinforcement Learning (RL) or truncated Backpropagation (BP). However, RL and truncated BP suffer from low sample efficiency and biased gradient estimation, respectively, resulting in limited improvement or, even worse, complete training failure. To overcome the challenges, we propose the Recursive Likelihood Ratio (RLR) optimizer, a Half-Order (HO) fine-tuning paradigm for DM. The HO gradient estimator enables the computation graph rearrangement within the recursive diffusive chain, making the…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

- The paper presents a novel fine-tuning scheme and devises gradient estimators for the diffusion model’s chain-of-thought, which appears genuinely innovative. - The theoretical analysis is careful and offers a credible justification for the proposed approach.

Weaknesses

- Missing a comparison with related diffusion model fine-tuning baselines, such as D3PO[1]. - The experiments are limited to SD 1.4 and SD 2.0, which are now dated. Moreover, the method’s generalization to the Flux architecture remains unclear. $\text{[1] Yang, Kai, et al. "Using human feedback to fine-tune diffusion models without any reward model." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.}$

Reviewer 02Rating 8Confidence 3

Strengths

- This paper addresses the crucial task of efficiently aligning foundation diffusion models. This represents a highly important and practically valuable problem. - This paper proposes the Recursive Likelihood Ratio (RLR) optimizer, a novel "Half-Order" (HO) fine-tuning paradigm that successfully overcomes these challenges. T - he paper also introduces a novel prompt technique that synergizes naturally with the RLR optimizer, further enhancing the originality of the contribution. - The paper ri

Weaknesses

- The paper exhibits significant inconsistencies in its core methodology description, particularly regarding the sampling strategy for the Half-Order (HO) sub-chain starting point, $j$. In Section 4.2 (Methodology), the paper describes $j$ as being sampled from a categorical distribution based on gradient norms. However, in Section 5.3 (DCoT Experiment), $j$ is described as being selected from a uniform distribution ($j \sim \mathcal{U}(1, T-h)$). This contradictory description makes it impossib

Reviewer 03Rating 6Confidence 4

Strengths

1. The innovative concept of "half-order" fine-tuning paradigm is proposed, which fills the gap between traditional first-order and zero-order methods. 2. FO, HO and ZO complement each other's strengths and find the optimal balance between variance and computational cost by optimizing the h and j parameters, taking into account the actual computational budget constraints.

Weaknesses

1. The problem with FO is its high cost, and the problem with ZO is its high variance, but the author does not provide a clear analysis to explain this. For example, for a specific scenario, how many NFEs are needed for FO, ZO, and HO respectively, how is this calculated, what are the variances of these three, and why is there such a large variance problem. I think this needs a clearer analysis. 2. The visualization results look average, and the improvement is not significant enough. Lacks compa

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Advanced Mathematical Modeling in Engineering

MethodsALIGN · Diffusion