A Unified Framework for Rethinking Policy Divergence Measures in GRPO
Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gall\'e, Chao Huang

TL;DR
This paper introduces a unified framework for policy divergence measures in reinforcement learning with verified reward, analyzing their effects on exploration and stability, and proposes the KL3 estimator to improve performance.
Contribution
It develops a general framework for policy divergence, introduces the KL3 estimator, and demonstrates its benefits in stability and performance in LLM reasoning tasks.
Findings
KL3 estimator reduces variance in divergence measurement
Incorporating KL3 improves training stability
Enhanced performance on reasoning benchmarks
Abstract
Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Reinforcement Learning in Robotics
