Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
Mingwei Xu, Hao Fang

TL;DR
This paper introduces Positive-Only Policy Optimization (POPO), a reinforcement learning framework that learns solely from positive rollouts, achieving competitive or superior performance to existing methods like GRPO in language model benchmarks.
Contribution
POPO is a novel RLVR framework that eliminates the need for negative rollouts, using implicit negative gradients and stabilization techniques to improve policy optimization.
Findings
POPO achieves 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO's 30.00%.
POPO's components are shown to be necessary and robust through ablation studies.
Experiments demonstrate POPO's performance is comparable or superior to GRPO across mathematical benchmarks.
Abstract
Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
