TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models
Zheng Ding, Weirui Ye

TL;DR
TreeGRPO introduces a tree-structured reinforcement learning framework that significantly enhances training efficiency and performance in aligning diffusion models with human preferences, reducing computational costs and improving reward optimization.
Contribution
The paper presents TreeGRPO, a novel tree-based RL method that improves sample efficiency, enables fine-grained credit assignment, and allows amortized computation for better training of generative models.
Findings
Achieves 2.4× faster training compared to baselines
Outperforms GRPO in multiple benchmarks and reward models
Provides a scalable approach for RL-based generative model alignment
Abstract
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce \textbf{TreeGRPO}, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) \emph{High sample efficiency}, achieving better performance under same training samples (2) \emph{Fine-grained credit assignment} via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) \emph{Amortized computation} where…
Peer Reviews
Decision·ICLR 2026 Poster
(+) I haven't seen the idea of tree-structured advantage estimation before. (+) I appreciate the ablations performed.
(-) From what I can tell, the results are not significantly stronger than similar baselines performance-wise. That said, requiring less compute is definitely a plus.
1. Recasting diffusion denoising as a search tree with shared prefixes is a creative idea that directly addresses sample efficiency and credit assignment issues. The use of log probability weighted backup for per-edge advantages is theoretically sound. 2. In terms of efficiency gains, TreeGRPO reduces per‑iteration training time by ~$2\times$ - $3\times$ while matching or surpassing baseline alignment scores. The method shows especially strong improvements in aesthetic scores. 3. The paper provi
1. While the method amortizes computation, branching multiple trajectories simultaneously increases memory usage, especially for large diffusion models. The paper does not quantify the computational overhead relative to baselines or provide strategies for memory management beyond acknowledging the issue. 2. The performance improvement is somewhat marginal. Although TreeGRPO outperforms baselines on HPS and aesthetics, DanceGRPO achieves the highest ImageReward score in the single reward setting.
The paper tackles the problem of making RL-based fine-tuning of vision-based generative models more efficient, which is a well-motivated and common problem in the existing literature. Additionally, TreeGRPO presents a significant improvement in runtime against the baselines considered, while matching or improving performance.
My main concern is the following, with some additional questions/concerns listed below. In Section 3.3., the paper claims that “Deterministic ODE solvers lack the transition probabilities required by policy-gradient RL...” and “we convert the probability-flow ODE... to an equivalent SDE that admits tractable likelihoods...”. Likelihoods necessary for RL are not tractable in SDE’s (e.g. due to the Brownian motion)? How is the paper computing log likelihoods for the advantage-based update in Equat
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning
