Policy Improvement Reinforcement Learning
Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban

TL;DR
This paper introduces PIRL and PIPO frameworks for reinforcement learning that explicitly measure and optimize policy improvements over iterations, enhancing stability and performance in language models.
Contribution
The paper proposes a novel reinforcement learning framework that directly optimizes cumulative policy improvement, with a practical implementation for self-correcting policy updates.
Findings
PIPO evaluates and reinforces genuine policy improvements at each iteration.
PIRL aligns the optimization objective with final task performance.
Experiments show improved stability and performance on reasoning benchmarks.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means optimization can drift or collapse with no mechanism to detect and correct these failures. We argue that the missing ingredient is policy improvement feedback: the ability to measure and optimize inter-iteration progress directly. To this end, we introduce Policy Improvement Reinforcement Learning (PIRL), a framework that replaces surrogate reward maximization with the explicit objective of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
