A Unified Framework for Rethinking Policy Divergence Measures in GRPO

Qingyuan Wu; Yuhui Wang; Simon Sinong Zhan; Yanning Dai; Shilong Deng; Sarra Habchi; Qi Zhu; Matthias Gall\'e; Chao Huang

arXiv:2602.05494·cs.LG·February 10, 2026

A Unified Framework for Rethinking Policy Divergence Measures in GRPO

Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gall\'e, Chao Huang

PDF

Open Access

TL;DR

This paper introduces a unified framework for policy divergence measures in reinforcement learning with verified reward, analyzing their effects on exploration and stability, and proposes the KL3 estimator to improve performance.

Contribution

It develops a general framework for policy divergence, introduces the KL3 estimator, and demonstrates its benefits in stability and performance in LLM reasoning tasks.

Findings

01

KL3 estimator reduces variance in divergence measurement

02

Incorporating KL3 improves training stability

03

Enhanced performance on reasoning benchmarks

Abstract

Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Reinforcement Learning in Robotics