Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo; Lijie Xu; Jie Liu; Dan Ye; Shuang Qiu

arXiv:2505.23564·cs.LG·October 22, 2025

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Segment Policy Optimization (SPO), a reinforcement learning framework that improves large language model reasoning by using segment-level advantage estimation, balancing granularity and estimation accuracy without requiring a critic model.

Contribution

SPO is a novel RL method that employs segment-level advantage estimation with new strategies for partitioning and advantage calculation, enhancing reasoning accuracy in language models.

Findings

01

SPO outperforms PPO and GRPO on GSM8K with 6-12% accuracy improvements.

02

SPO significantly reduces Monte Carlo estimation costs for long chain-of-thought reasoning.

03

SPO achieves 7-11% improvements over GRPO on MATH500 datasets.

Abstract

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aiframeresearch/spo
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Reinforcement Learning in Robotics

MethodsADaptive gradient method with the OPTimal convergence rate · Entropy Regularization · Proximal Policy Optimization