On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Yifan Zhang; Yifeng Liu; Huizhuo Yuan; Yang Yuan; Quanquan Gu; Andrew Chi-Chih Yao

arXiv:2505.17508·cs.LG·February 20, 2026

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces a unified framework called RPG for designing KL-regularized policy gradient algorithms that improve reasoning in large language models, ensuring stable, off-policy training with better accuracy on reasoning benchmarks.

Contribution

It provides a unified derivation of KL variants, corrects importance-weighting mismatches, and introduces RPG-Style Clip for stable off-policy training of LLMs.

Findings

01

RPG-REINFORCE with RPG-Style Clip improves accuracy by up to 6 percentage points on reasoning benchmarks.

02

Achieves 52% accuracy on AIME25 with 8K context length, surpassing some existing models.

03

Demonstrates stability and scalability of the proposed RL algorithm for LLM reasoning.

Abstract

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ( $k_{1} / k_{2} / k_{3}$ ), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_{3}$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

* Nice theoretical analysis between k3 estimator and UKL * The tackled KL divergence issue is indeed a pitfall of GRPO

Weaknesses

The experiments in the paper are quite poorly conducted: * The model is tested only on math tasks, and even then, only two AIME benchmarks are used. * Figure 2 is presented without any analysis. It seems that in Figure 2(c), DAPO is not converging — what would it look like if we extrapolated the training curve? * (c) RL training often exhibits high variance; instead of showing a single curve, the authors should include a curve with a standard deviation region. * (d) In Figure 2(d), why is only G

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper provides a unified framework connecting various KL regularization forms. The derivations offer clarity and formal grounding. 2. The authors propose RPG-Style Clip, which can effectively balance stability and bias under off-policy training and improve scalability for LLMs 3. The experiments demonstrate notable improvements in both accuracy and stability.

Weaknesses

1. Some representative baselines are missing, e.g., RLOO, REINFORCE++, GPG. 2. The experiments lack ablation studies isolating the contributions of RPG-Style Clip, reference-update frequency, and clipping thresholds. 3. The RPG-Style Clip introduces a bias-variance trade-off but it is not explored quantitatively. 4. Section 3 and 4 are heavy and hard to follow.

Reviewer 03Rating 6Confidence 3

Strengths

- It provides a compact derivation that covers Forward/Reverse × Normalized/Unnormalized KL, with explicit off-policy corrections and both differentiable and REINFORCE-style surrogates. - It pinpoints the missing importance weight in GRPO’s KL when sampling off-policy, which could be an issue that many people overlook. - The paper introduces RPG-Style Clip (dual-clip/truncation) and iterative reference updates, which empirically stabilize off-policy PG for LLM reasoning while requiring only a si

Weaknesses

- All core results are on Qwen3-4B and math reasoning. Claims about “stable and scalable RL” would be stronger with larger backbones (e.g., 7B–14B) and diverse tasks or datasets. - While GRPO/DAPO are relevant, head-to-head comparisons with varying KL directions (FKL/RKL vs UFKL/URKL) would solidify the empirical story. Current ablations on clip hyper-parameters and reference-update schedules are limited in the main text. - RPG-Style Clip introduces controlled bias via truncation. The paper ackn

Code & Models

Datasets

math-dataset/DAPO-17k-Eng
dataset· 99 dl
99 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization