SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Tianyi Wang; Yixia Li; Long Li; Yibiao Chen; Shaohan Huang; Yun Chen; Peng Li; Yang Liu; Guanhua Chen

arXiv:2604.08865·cs.AI·April 13, 2026

SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks

Tianyi Wang, Yixia Li, Long Li, Yibiao Chen, Shaohan Huang, Yun Chen, Peng Li, Yang Liu, Guanhua Chen

PDF

1 Repo

TL;DR

SPPO is a new scalable reinforcement learning algorithm that improves long-horizon reasoning in language models by combining sample efficiency with stable outcome-based updates, outperforming standard PPO.

Contribution

Introduces Sequence-Level PPO (SPPO), a novel method reformulating reasoning as a Contextual Bandit problem with a decoupled value function for better efficiency and stability.

Findings

01

SPPO outperforms standard PPO on mathematical benchmarks.

02

SPPO matches the performance of computationally intensive group-based methods.

03

SPPO offers a resource-efficient approach for reasoning LLMs.

Abstract

Proximal Policy Optimization (PPO) is central to aligning Large Language Models (LLMs) in reasoning tasks with verifiable rewards. However, standard token-level PPO struggles in this setting due to the instability of temporal credit assignment over long Chain-of-Thought (CoT) horizons and the prohibitive memory cost of the value model. While critic-free alternatives like GRPO mitigate these issues, they incur significant computational overhead by requiring multiple samples for baseline estimation, severely limiting training throughput. In this paper, we introduce Sequence-Level PPO (SPPO), a scalable algorithm that harmonizes the sample efficiency of PPO with the stability of outcome-based updates. SPPO reformulates the reasoning process as a Sequence-Level Contextual Bandit problem, employing a decoupled scalar value function to derive low-variance advantage signals without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sustech-nlp/SPPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.