Truncated Proximal Policy Optimization

Tiantian Fan; Lingjun Liu; Yu Yue; Jiaze Chen; Chengyi Wang; Qiying Yu; Chi Zhang; Zhiqi Lin; Ruofei Zhu; Yufeng Yuan; Xiaochen Zuo; Bole Ma; Mofan Zhang; Gaohong Liu; Ru Zhang; Haotian Zhou; Cong Xie; Ruidong Zhu; Zhi Zhang; Xin Liu; Mingxuan Wang; Lin Yan; Yonghui Wu

arXiv:2506.15050·cs.AI·June 19, 2025

Truncated Proximal Policy Optimization

Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Bole Ma, Mofan Zhang, Gaohong Liu, Ru Zhang, Haotian Zhou, Cong Xie, Ruidong Zhu, Zhi Zhang, Xin Liu, Mingxuan Wang, Lin Yan, Yonghui Wu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces T-PPO, an efficient extension of PPO for training large language models, which reduces training time and maintains performance by handling incomplete responses and optimizing computations.

Contribution

The paper proposes T-PPO, featuring EGAE for advantage estimation from partial responses and a mechanism for independent policy and value model optimization, enhancing training efficiency.

Findings

01

T-PPO achieves up to 2.5x faster training efficiency.

02

T-PPO outperforms existing methods on reasoning tasks.

03

The approach maintains convergence performance despite efficiency improvements.

Abstract

Recently, test-time scaling Large Language Models (LLMs) have demonstrated exceptional reasoning capabilities across scientific and professional tasks by generating long chains-of-thought (CoT). As a crucial component for developing these reasoning models, reinforcement learning (RL), exemplified by Proximal Policy Optimization (PPO) and its variants, allows models to learn through trial and error. However, PPO can be time-consuming due to its inherent on-policy nature, which is further exacerbated by increasing response lengths. In this work, we propose Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

The target problem is important, as partial rollouts naturally arise in large-scale RL for LLMs. The presentation is clear, and the reported experimental results show promising improvements.

Weaknesses

The method is incremental. The proposed EGAE differs little from standard GAE, and the overall algorithm resembles prior partial-rollout implementations such as those used in Kimi. The paper does not clearly justify what is new or why the proposed adjustment yields improvement. In addition, the analysis is limited. There is no theoretical or empirical discussion of bias introduced by truncation, nor comparison with other off-policy corrections such as partial rollout with importance sampling, w

Reviewer 02Rating 4Confidence 4

Strengths

This is a method that the research community needs. Because Chain-of-Thought models produce very long token sequences, training them is practically infeasible for most researchers outside major companies. Although the proposed method is limited to PPO, a technique that accelerates training for long output sequences has significant practical value. The method description is clear and easy to follow. The experiments convincingly support the authors’ claims.

Weaknesses

Limited experimental comparisons. The paper reports experiments on only one dataset. The core assumption of TPPO, V(s_l) = V(s_{l-1}), is a strong approximation whose theoretical error bound is not proven. Therefore, the authors should conduct experiments on multiple datasets to demonstrate that the approximation holds in general. In particular, it is unclear whether TPPO performs well on shorter datasets where the value function changes rapidly, or on non-mathematical domains. Based on the pre

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper is clearly written and well organized, making the method easy to follow. 2. The proposed approach appears computationally efficient for the chosen model and dataset.

Weaknesses

1. The paper lacks a Related Work section. The authors should review recent advances in related areas. 2. The value function estimates the expected return under the current policy; the two should be tightly coupled. The current design partially decouples value training from policy updates. 3. The experiments are too limited: only one model and one dataset are used, seemingly with a single training run. There is no robustness analysis across seeds, datasets, or models. 4. Ablation studies on key

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Risk and Portfolio Optimization