Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training
Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, Bo Zhou

TL;DR
This paper introduces Segmental Advantage Estimation (SAE), a novel method to improve advantage estimation in PPO for long-context LLM training, leading to more stable and efficient reinforcement learning with sparse rewards.
Contribution
SAE partitions sequences into segments to compute advantage estimates, reducing bias and noise compared to GAE, and enhances PPO performance in long-context LLM training.
Findings
SAE outperforms GAE in final scores and stability.
SAE improves sample efficiency across multiple model sizes.
Higher correlation with ground-truth advantage confirms SAE's accuracy.
Abstract
Training Large Language Models (LLMs) for reasoning tasks is increasingly driven by Reinforcement Learning with Verifiable Rewards (RLVR), where Proximal Policy Optimization (PPO) provides a principled framework for stable policy updates. However, the practical application of PPO is hindered by unreliable advantage estimation in the sparse-reward RLVR regime. This issue arises because the sparse rewards in RLVR lead to inaccurate intermediate value predictions, which in turn introduce significant bias when aggregated at every token by Generalized Advantage Estimation (GAE). To address this, we introduce Segmental Advantage Estimation (SAE), which mitigates the bias that GAE can incur in RLVR. Our key insight is that aggregating -step advantages at every token(as in GAE) is unnecessary and often introduces excessive bias, since individual tokens carry minimal information. Instead, SAE…
Peer Reviews
Decision·Submitted to ICLR 2026
(1) The proposed method in this paper is practically elegant, as its recursive formulation allows for seamless integration into existing PPO frameworks with minimal computational overhead. (2) The empirical evaluation is thorough, benchmarking against strong baselines like GRPO and adaptive PPO variants across multiple out-of-distribution test sets (AIME, AMC). The consistent performance gains across 4B, 8B, and 14B model sizes strongly support the method's robustness and scalability.
(1) While the use of low-probability tokens is intuitive, it is an unsupervised method that may not always align perfectly with true semantic boundaries, potentially introducing its own form of noise. (2) The evaluation is confined to mathematical reasoning. While this is a canonical domain for RLVR, the paper does not demonstrate SAE's efficacy in other long-context scenarios like code generation or complex dialogue, limiting the claimed generality of the approach.
- The paper is well written and easy to follow. - Accurate value estimation is crucial for PPO algorithms. The idea of segmenting responses based on low-probability tokens is intuitive and makes sense. - Experiments are conducted to demonstrate the effectiveness of the proposed method compared to standard PPO.
My main concern is the lack of comparison with related baseline: - There is no comparison with the mentioned related works, such as VC-PPO and VAPO - Previous studies have proposed computing GAE at the step level (e.g., by splitting sequences using special tokens such as ‘\n’) [1]. This paper is closely related to those approaches, and a comparison with them would help better demonstrate the effectiveness of the proposed method. [1]Chen, Guoxin, et al. "Alphamath almost zero: process supervis
- Well motivated problem of instablity of GAE estimation in RLVR where $\lambda$ is set to 1 - Insightful solution to focus on segments of response for GAE estimation instead of per token - Theoretical analysis to justify that SAE reduces the bias in estimation. - Emprirical analysis showcasing SAE has highest correlation with true Advantage compared to other baselines in a controlled setting.
- The SAE method uses a fixed threshold of 0.2 on the probability to decide the segments. I would have preferred an abilation study for the choice of this parameter. - I would prefer to have SAE compared with the simple baseline of fixed length segments from the theoretical analysis of section 4.2. For example, what is the effect when I naive let chunks to be of size $M=100$ or $200$ tokens irrespective of the probablity. Does the choice of segmentation method matter towards the downstream perfo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Explainable Artificial Intelligence (XAI)
