SSPO: Subsentence-level Policy Optimization

Kun Yang; Zikang chen; Yanmeng Wang; Zhigen Li; Ning Cheng; Shaojun Wang; Jing Xiao

arXiv:2511.04256·cs.CL·April 13, 2026

SSPO: Subsentence-level Policy Optimization

Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li, Ning Cheng, Shaojun Wang, Jing Xiao

PDF

TL;DR

SSPO introduces subsentence-level importance ratios in reinforcement learning for large language models, balancing stability and variance reduction, leading to improved reasoning performance across multiple datasets.

Contribution

The paper proposes SSPO, a novel subsentence-level policy optimization method that mitigates stability issues in RLVR for LLMs, outperforming existing approaches.

Findings

01

SSPO outperforms GRPO and GSPO on five datasets.

02

SSPO achieves state-of-the-art results on four datasets.

03

SSPO improves training stability and reasoning accuracy.

Abstract

As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues: GRPO (Group Relative Policy Optimization) often suffers from unstable policy updates, while GSPO (Group Sequence Policy Optimization) can retain high-variance tokens. In GRPO, the importance ratio is computed at the token level, which overemphasizes individual tokens and makes learning sensitive to outliers, potentially causing training collapse. GSPO instead computes a response-level importance ratio, mitigating variance and reducing the accumulation of token-level noise present in GRPO. Nevertheless, our experiments show that GSPO frequently yields a near-zero clipping fraction: extreme token-level ratios can be diluted by other tokens in the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.