EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL

Lunjun Zhang; Jimmy Ba

arXiv:2602.04417·cs.LG·February 5, 2026

EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL

Lunjun Zhang, Jimmy Ba

PDF

Open Access

TL;DR

This paper introduces EMA anchor and Top-k KL estimator techniques to enhance policy gradient methods for reinforcement learning in large language models, resulting in improved performance and stability.

Contribution

It proposes EMA anchor and Top-k KL estimator as novel techniques to improve the stability and accuracy of policy gradient algorithms for LLMs.

Findings

01

Significant performance improvements on math reasoning benchmarks.

02

Enhanced agentic RL domain results across multiple datasets.

03

Demonstrated stability conditions for EMA anchor in RL.

Abstract

Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Natural Language Processing Techniques