TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Yizhi Li; Qingshui Gu; Zhoufutu Wen; Ziniu Li; Tianshun Xing; Shuyue Guo; Tianyu Zheng; Xin Zhou; Xingwei Qu; Wangchunshu Zhou; Zheng Zhang; Wei Shen; Qian Liu; Chenghua Lin; Jian Yang; Ge Zhang; Wenhao Huang

arXiv:2508.17445·cs.LG·August 26, 2025

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, Zheng Zhang, Wei Shen, Qian Liu, Chenghua Lin, Jian Yang, Ge Zhang, Wenhao Huang

PDF

2 Models 1 Datasets

TL;DR

TreePO introduces a tree-structured rollout algorithm for reinforcement learning in language models, significantly improving exploration diversity and computational efficiency during training and inference.

Contribution

It presents a novel segment-wise sampling and tree-based advantage estimation method that reduces compute costs while maintaining or improving model performance.

Findings

01

Achieves up to 43% GPU hour savings in sampling.

02

Demonstrates up to 40% reduction in trajectory-level sampling compute.

03

Improves reasoning benchmark performance with enhanced exploration.

Abstract

Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

m-a-p/TreePO_data
dataset· 133 dl
133 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.