TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Zheng Ding; Weirui Ye

arXiv:2512.08153·cs.LG·December 10, 2025

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Zheng Ding, Weirui Ye

PDF

Open Access 3 Reviews

TL;DR

TreeGRPO introduces a tree-structured reinforcement learning framework that significantly enhances training efficiency and performance in aligning diffusion models with human preferences, reducing computational costs and improving reward optimization.

Contribution

The paper presents TreeGRPO, a novel tree-based RL method that improves sample efficiency, enables fine-grained credit assignment, and allows amortized computation for better training of generative models.

Findings

01

Achieves 2.4× faster training compared to baselines

02

Outperforms GRPO in multiple benchmarks and reward models

03

Provides a scalable approach for RL-based generative model alignment

Abstract

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

(+) I haven't seen the idea of tree-structured advantage estimation before. (+) I appreciate the ablations performed.

Weaknesses

(-) From what I can tell, the results are not significantly stronger than similar baselines performance-wise. That said, requiring less compute is definitely a plus.

Reviewer 02Rating 4Confidence 3

Strengths

1. Recasting diffusion denoising as a search tree with shared prefixes is a creative idea that directly addresses sample efficiency and credit assignment issues. The use of log probability weighted backup for per-edge advantages is theoretically sound. 2. In terms of efficiency gains, TreeGRPO reduces per‑iteration training time by ~$2\times$ - $3\times$ while matching or surpassing baseline alignment scores. The method shows especially strong improvements in aesthetic scores. 3. The paper provi

Weaknesses

1. While the method amortizes computation, branching multiple trajectories simultaneously increases memory usage, especially for large diffusion models. The paper does not quantify the computational overhead relative to baselines or provide strategies for memory management beyond acknowledging the issue. 2. The performance improvement is somewhat marginal. Although TreeGRPO outperforms baselines on HPS and aesthetics, DanceGRPO achieves the highest ImageReward score in the single reward setting.

Reviewer 03Rating 4Confidence 3

Strengths

The paper tackles the problem of making RL-based fine-tuning of vision-based generative models more efficient, which is a well-motivated and common problem in the existing literature. Additionally, TreeGRPO presents a significant improvement in runtime against the baselines considered, while matching or improving performance.

Weaknesses

My main concern is the following, with some additional questions/concerns listed below. In Section 3.3., the paper claims that “Deterministic ODE solvers lack the transition probabilities required by policy-gradient RL...” and “we convert the probability-flow ODE... to an equivalent SDE that admits tractable likelihoods...”. Likelihoods necessary for RL are not tractable in SDE’s (e.g. due to the Brownian motion)? How is the paper computing log likelihoods for the advantage-based update in Equat

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning