Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training

Xue Gong; Qi Yi; Ziyuan Nan; Guanhua Huang; Kejiao Li; Yuhao Jiang; Ruibin Xiong; Zenan Xu; Jiaming Guo; Shaohui Peng; Bo Zhou

arXiv:2601.07320·cs.LG·January 13, 2026

Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training

Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, Bo Zhou

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Segmental Advantage Estimation (SAE), a novel method to improve advantage estimation in PPO for long-context LLM training, leading to more stable and efficient reinforcement learning with sparse rewards.

Contribution

SAE partitions sequences into segments to compute advantage estimates, reducing bias and noise compared to GAE, and enhances PPO performance in long-context LLM training.

Findings

01

SAE outperforms GAE in final scores and stability.

02

SAE improves sample efficiency across multiple model sizes.

03

Higher correlation with ground-truth advantage confirms SAE's accuracy.

Abstract

Training Large Language Models (LLMs) for reasoning tasks is increasingly driven by Reinforcement Learning with Verifiable Rewards (RLVR), where Proximal Policy Optimization (PPO) provides a principled framework for stable policy updates. However, the practical application of PPO is hindered by unreliable advantage estimation in the sparse-reward RLVR regime. This issue arises because the sparse rewards in RLVR lead to inaccurate intermediate value predictions, which in turn introduce significant bias when aggregated at every token by Generalized Advantage Estimation (GAE). To address this, we introduce Segmental Advantage Estimation (SAE), which mitigates the bias that GAE can incur in RLVR. Our key insight is that aggregating $n$ -step advantages at every token(as in GAE) is unnecessary and often introduces excessive bias, since individual tokens carry minimal information. Instead, SAE…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

(1) The proposed method in this paper is practically elegant, as its recursive formulation allows for seamless integration into existing PPO frameworks with minimal computational overhead. (2) The empirical evaluation is thorough, benchmarking against strong baselines like GRPO and adaptive PPO variants across multiple out-of-distribution test sets (AIME, AMC). The consistent performance gains across 4B, 8B, and 14B model sizes strongly support the method's robustness and scalability.

Weaknesses

(1) While the use of low-probability tokens is intuitive, it is an unsupervised method that may not always align perfectly with true semantic boundaries, potentially introducing its own form of noise. (2) The evaluation is confined to mathematical reasoning. While this is a canonical domain for RLVR, the paper does not demonstrate SAE's efficacy in other long-context scenarios like code generation or complex dialogue, limiting the claimed generality of the approach.

Reviewer 02Rating 2Confidence 3

Strengths

- The paper is well written and easy to follow. - Accurate value estimation is crucial for PPO algorithms. The idea of segmenting responses based on low-probability tokens is intuitive and makes sense. - Experiments are conducted to demonstrate the effectiveness of the proposed method compared to standard PPO.

Weaknesses

My main concern is the lack of comparison with related baseline: - There is no comparison with the mentioned related works, such as VC-PPO and VAPO - Previous studies have proposed computing GAE at the step level (e.g., by splitting sequences using special tokens such as ‘\n’) [1]. This paper is closely related to those approaches, and a comparison with them would help better demonstrate the effectiveness of the proposed method. [1]Chen, Guoxin, et al. "Alphamath almost zero: process supervis

Reviewer 03Rating 8Confidence 4

Strengths

- Well motivated problem of instablity of GAE estimation in RLVR where $\lambda$ is set to 1 - Insightful solution to focus on segments of response for GAE estimation instead of per token - Theoretical analysis to justify that SAE reduces the bias in estimation. - Emprirical analysis showcasing SAE has highest correlation with true Advantage compared to other baselines in a controlled setting.

Weaknesses

- The SAE method uses a fixed threshold of 0.2 on the probability to decide the segments. I would have preferred an abilation study for the choice of this parameter. - I would prefer to have SAE compared with the simple baseline of fixed length segments from the theoretical analysis of section 4.2. For example, what is the effect when I naive let chunks to be of size $M=100$ or $200$ tokens irrespective of the probablity. Does the choice of segmentation method matter towards the downstream perfo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Explainable Artificial Intelligence (XAI)