A Comedy of Estimators: On KL Regularization in RL Training of LLMs

Vedant Shah; Johan Obando-Ceron; Vineet Jain; Brian Bartoldson; Bhavya Kailkhura; Sarthak Mittal; Glen Berseth; Pablo Samuel Castro; Yoshua Bengio; Nikolay Malkin; Moksh Jain; Siddarth Venkatraman; Aaron Courville

arXiv:2512.21852·cs.LG·March 19, 2026

A Comedy of Estimators: On KL Regularization in RL Training of LLMs

Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville

PDF

Open Access 3 Reviews

TL;DR

This paper systematically analyzes how different KL divergence estimators affect the training stability and performance of RL-finetuned large language models, revealing that unbiased estimators improve outcomes.

Contribution

It provides a detailed analysis of estimator configurations, their gradient biases, and empirical validation on multiple LLMs, highlighting the importance of unbiased KL estimators in RL training.

Findings

01

Biased gradient estimators can cause training instability.

02

Unbiased estimators lead to better in- and out-of-domain performance.

03

KL regularization stabilizes off-policy RL training.

Abstract

The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

- The pathwise vs. score-function breakdown (Eq. (6)) is clean. It identifies which term each implementation estimates. It is correct as presented. - Table 1 maps estimator × placement to bias and claimed behavior. This is usefull for practitioners.

Weaknesses

1. The main theoretical point is already made by Tang & Munos (2025). The paper admits this. The point is that many common KL implementations do not yield the reverse-KL gradient. Much of the math repeats known results and blog-level derivations. This includes: K1 is unbiased in reward; pathwise K1 gradient is zero in expectation; K3 is an unbiased *divergence* estimator but a biased gradient in both placements. The paper’s main claimed novelty are the table systematization and a small empirical

Reviewer 02Rating 6Confidence 1

Strengths

1. **Systematic analysis and unified perspective.** The paper provides a clear and thorough theoretical analysis of how different KL estimators behave when applied within RL training for LLMs. It systematically distinguishes the gradient properties of various configurations (e.g., K1-in-reward, K3-in-loss), offering a principled understanding of why some widely used implementations are biased or unstable. The paper helps unify several inconsistent practices used across current RLHF frameworks. T

Weaknesses

1. **Lack of practical cotribution.** The paper’s findings mainly reinforce practices that are already widely adopted—explicitly or implicitly—in existing RLHF implementations. As a result, the contribution feels more clarificatory than innovative, focusing on formalizing established patterns rather than proposing new directions.

Reviewer 03Rating 6Confidence 4

Strengths

+ This paper studies an important problem. + It provides a detailed comparison between different KL estimators, and the empirical results are comprehensive. + Several takeaways are offered that may guide a more principled use of KL divergence.

Weaknesses

+ The notation is a bit confusing. I had to go back and forth a couple of times to figure out what K1 and K3 actually stand for. The authors may consider name them in a more informative way. + The main observations are made empirically, it would be more insightful if further theoretical understanding are provided. + It remains unclear how different KL estimators interact with ratio clipping. That is, it does not fully isolate the effect of the KL estimator given the existence of the ratio clip

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications