Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Mingwei Xu; Hao Fang

arXiv:2605.06650·cs.CL·May 8, 2026

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Mingwei Xu, Hao Fang

PDF

TL;DR

This paper introduces Positive-Only Policy Optimization (POPO), a reinforcement learning framework that learns solely from positive rollouts, achieving competitive or superior performance to existing methods like GRPO in language model benchmarks.

Contribution

POPO is a novel RLVR framework that eliminates the need for negative rollouts, using implicit negative gradients and stabilization techniques to improve policy optimization.

Findings

01

POPO achieves 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO's 30.00%.

02

POPO's components are shown to be necessary and robust through ablation studies.

03

Experiments demonstrate POPO's performance is comparable or superior to GRPO across mathematical benchmarks.

Abstract

Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.