TreeAdv: Tree-Structured Advantage Redistribution for Group-Based RL
Lang Cao, Hui Ruan, Yongqian Li, Peng Chao, Wu Ning, Haonan Song, Renhong Chen, Yitong Li

TL;DR
TreeAdv introduces a tree-structured advantage redistribution method for group-based reinforcement learning, improving sample efficiency and logical depth in language model reasoning tasks.
Contribution
It explicitly models the tree structure of rollouts, redistributes advantages across internal segments, and outperforms existing methods on math reasoning benchmarks.
Findings
Outperforms GRPO and GSPO on 10 math reasoning benchmarks.
Uses fewer tokens while maintaining or improving performance.
Effectively redistributes advantages across tree segments.
Abstract
Reinforcement learning with group-based objectives, such as Group Relative Policy Optimization (GRPO), is a common framework for aligning large language models on complex reasoning tasks. However, standard GRPO treats each rollout trajectory as an independent flat sequence and assigns a single sequence-level advantage to all tokens, which leads to sample inefficiency and a length bias toward verbose, redundant chains of thought without improving logical depth. We introduce TreeAdv (Tree-Structured Advantage Redistribution for Group-Based RL), which makes the tree structure of group rollouts explicit for both exploration and advantage assignment. Specifically, TreeAdv builds a group of trees (a forest) based on an entropy-driven sampling method where each tree branches at high-uncertainty decisions while sharing low-uncertainty tokens across rollouts. Then, TreeAdv aggregates token-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
