SSPO: Subsentence-level Policy Optimization
Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li, Ning Cheng, Shaojun Wang, Jing Xiao

TL;DR
SSPO introduces subsentence-level importance ratios in reinforcement learning for large language models, balancing stability and variance reduction, leading to improved reasoning performance across multiple datasets.
Contribution
The paper proposes SSPO, a novel subsentence-level policy optimization method that mitigates stability issues in RLVR for LLMs, outperforming existing approaches.
Findings
SSPO outperforms GRPO and GSPO on five datasets.
SSPO achieves state-of-the-art results on four datasets.
SSPO improves training stability and reasoning accuracy.
Abstract
As a key component of large language model (LLM) post-training, Reinforcement Learning from Verifiable Rewards (RLVR) has substantially improved reasoning performance. However, existing RLVR algorithms exhibit distinct stability issues: GRPO (Group Relative Policy Optimization) often suffers from unstable policy updates, while GSPO (Group Sequence Policy Optimization) can retain high-variance tokens. In GRPO, the importance ratio is computed at the token level, which overemphasizes individual tokens and makes learning sensitive to outliers, potentially causing training collapse. GSPO instead computes a response-level importance ratio, mitigating variance and reducing the accumulation of token-level noise present in GRPO. Nevertheless, our experiments show that GSPO frequently yields a near-zero clipping fraction: extreme token-level ratios can be diluted by other tokens in the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
