Policy Improvement Reinforcement Learning

Huaiyang Wang; Xiaojie Li; Deqing Wang; Haoyi Zhou; Zixuan Huang; Yaodong Yang; Jianxin Li; Yikun Ban

arXiv:2604.00860·cs.LG·April 29, 2026

Policy Improvement Reinforcement Learning

Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban

PDF

TL;DR

This paper introduces PIRL and PIPO frameworks for reinforcement learning that explicitly measure and optimize policy improvements over iterations, enhancing stability and performance in language models.

Contribution

The paper proposes a novel reinforcement learning framework that directly optimizes cumulative policy improvement, with a practical implementation for self-correcting policy updates.

Findings

01

PIPO evaluates and reinforces genuine policy improvements at each iteration.

02

PIRL aligns the optimization objective with final task performance.

03

Experiments show improved stability and performance on reasoning benchmarks.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.