TreeRPO: Tree Relative Policy Optimization
Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang

TL;DR
TreeRPO introduces a tree sampling-based reward estimation method for reinforcement learning in LLMs, providing fine-grained feedback that improves reasoning accuracy and efficiency in mathematical problem-solving tasks.
Contribution
It proposes TreeRPO, a novel reward estimation technique that directly computes step-level rewards via tree sampling, enhancing learning signals for LLM reasoning.
Findings
Pass@1 accuracy of Qwen-2.5-Math increased from 19.0% to 35.5%.
Outperformed GRPO by 2.9% in accuracy.
Reduced average response length by 18.1%.
Abstract
Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Originality: First reward-model-free RL approach that delivers per-step (process-style) supervision via tree sampling rather than learned PRMs or Monte-Carlo roll-outs - Empirical gains: Consistent +2–3% absolute accuracy improvements over a strong GRPO baseline on four maths benchmarks, with lower token cost - Clarity of method: The recursive reward back-propagation and group-level filtering are easy to implement and clearly described. - Well-scoped: Stays within verifiable-reward domains (ma
Missing baselines - No comparison with any PRM-based method (e.g., Math-Shepherd, Wang et al. 2024) or step-level RL (Step-DPO, Lai et al. 2024). Hence the claim “first to provide dense signals without a reward model” is not sufficient to establish superiority over existing PRM pipelines. Statistical rigor - Results are reported as single-run curves (Fig. 3) without standard deviations or confidence intervals. With only 500–1k test questions, variance can be high. Ablations incomplete - Tree d
The strengths of the work are as follows. - First, the paper touches on an important subject in which GRPO is a critical part of training today's LLMs. - Second, the technical approach is generally sound, and the sampling approach is well-motivated. - Finally, the paper is empirically strong and shows a large performance improvement on Qwen-2.5-Math-1.5B and outperforms GRPO.
The weaknesses are as follows. - First, the sampling approach is quite computationally expensive. Although the approach is model free, it still requires a large cost for sampling. - Second, the ablation studies could be improved e.g. with additional analysis on the branching factor or the depth of the tree. - Finally, the model sizes (1.5B, 7B) tested is a bit small, and it's curious about how the method would do on larger model families. Additionally, the performance gain is not as great with t
1. Reward model-free step-level reward estimation: TREERPO addresses the crucial limitation of trajectory-level RL: dense rewards. 2. Efficiency and performance gains: experimental results demonstrate that TREERPO significantly boosts Pass@1 accuracy by 2.9% over GRPO on multiple mathematics benchmarks and provide improved computational efficiency
1. There is no theoretical grounding provided for the proposed method: does it yield more effective estimates for the loss? Can we compare the variance of these estimates with those of GRPO? There should at least be some intuition, examples, formulas, or simulation studies. Since GRPO can be seen as a special case of the TreeRPO algorithmically (when the branching factor is 1), why does a greater branching factor lead to better results? To me, there is no clear intuition supporting this. 2. In
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsClimate Change Policy and Economics · Electric Power System Optimization · Game Theory and Voting Systems
