TreeRPO: Tree Relative Policy Optimization

Zhicheng Yang; Zhijiang Guo; Yinya Huang; Xiaodan Liang; Yiwei Wang; Jing Tang

arXiv:2506.05183·cs.LG·September 30, 2025

TreeRPO: Tree Relative Policy Optimization

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, Jing Tang

PDF

Open Access 3 Reviews

TL;DR

TreeRPO introduces a tree sampling-based reward estimation method for reinforcement learning in LLMs, providing fine-grained feedback that improves reasoning accuracy and efficiency in mathematical problem-solving tasks.

Contribution

It proposes TreeRPO, a novel reward estimation technique that directly computes step-level rewards via tree sampling, enhancing learning signals for LLM reasoning.

Findings

01

Pass@1 accuracy of Qwen-2.5-Math increased from 19.0% to 35.5%.

02

Outperformed GRPO by 2.9% in accuracy.

03

Reduced average response length by 18.1%.

Abstract

Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

- Originality: First reward-model-free RL approach that delivers per-step (process-style) supervision via tree sampling rather than learned PRMs or Monte-Carlo roll-outs - Empirical gains: Consistent +2–3% absolute accuracy improvements over a strong GRPO baseline on four maths benchmarks, with lower token cost - Clarity of method: The recursive reward back-propagation and group-level filtering are easy to implement and clearly described. - Well-scoped: Stays within verifiable-reward domains (ma

Weaknesses

Missing baselines - No comparison with any PRM-based method (e.g., Math-Shepherd, Wang et al. 2024) or step-level RL (Step-DPO, Lai et al. 2024). Hence the claim “first to provide dense signals without a reward model” is not sufficient to establish superiority over existing PRM pipelines. Statistical rigor - Results are reported as single-run curves (Fig. 3) without standard deviations or confidence intervals. With only 500–1k test questions, variance can be high. Ablations incomplete - Tree d

Reviewer 02Rating 4Confidence 3

Strengths

The strengths of the work are as follows. - First, the paper touches on an important subject in which GRPO is a critical part of training today's LLMs. - Second, the technical approach is generally sound, and the sampling approach is well-motivated. - Finally, the paper is empirically strong and shows a large performance improvement on Qwen-2.5-Math-1.5B and outperforms GRPO.

Weaknesses

The weaknesses are as follows. - First, the sampling approach is quite computationally expensive. Although the approach is model free, it still requires a large cost for sampling. - Second, the ablation studies could be improved e.g. with additional analysis on the branching factor or the depth of the tree. - Finally, the model sizes (1.5B, 7B) tested is a bit small, and it's curious about how the method would do on larger model families. Additionally, the performance gain is not as great with t

Reviewer 03Rating 2Confidence 3

Strengths

1. Reward model-free step-level reward estimation: TREERPO addresses the crucial limitation of trajectory-level RL: dense rewards. 2. Efficiency and performance gains: experimental results demonstrate that TREERPO significantly boosts Pass@1 accuracy by 2.9% over GRPO on multiple mathematics benchmarks and provide improved computational efficiency

Weaknesses

1. There is no theoretical grounding provided for the proposed method: does it yield more effective estimates for the loss? Can we compare the variance of these estimates with those of GRPO? There should at least be some intuition, examples, formulas, or simulation studies. Since GRPO can be seen as a special case of the TreeRPO algorithmically (when the branching factor is 1), why does a greater branching factor lead to better results? To me, there is no clear intuition supporting this. 2. In

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsClimate Change Policy and Economics · Electric Power System Optimization · Game Theory and Voting Systems