Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu

TL;DR
This paper introduces Segment Policy Optimization (SPO), a reinforcement learning framework that improves large language model reasoning by using segment-level advantage estimation, balancing granularity and estimation accuracy without requiring a critic model.
Contribution
SPO is a novel RL method that employs segment-level advantage estimation with new strategies for partitioning and advantage calculation, enhancing reasoning accuracy in language models.
Findings
SPO outperforms PPO and GRPO on GSM8K with 6-12% accuracy improvements.
SPO significantly reduces Monte Carlo estimation costs for long chain-of-thought reasoning.
SPO achieves 7-11% improvements over GRPO on MATH500 datasets.
Abstract
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Reinforcement Learning in Robotics
MethodsADaptive gradient method with the OPTimal convergence rate · Entropy Regularization · Proximal Policy Optimization
