EMA Policy Gradient: Taming Reinforcement Learning for LLMs with EMA Anchor and Top-k KL
Lunjun Zhang, Jimmy Ba

TL;DR
This paper introduces EMA anchor and Top-k KL estimator techniques to enhance policy gradient methods for reinforcement learning in large language models, resulting in improved performance and stability.
Contribution
It proposes EMA anchor and Top-k KL estimator as novel techniques to improve the stability and accuracy of policy gradient algorithms for LLMs.
Findings
Significant performance improvements on math reasoning benchmarks.
Enhanced agentic RL domain results across multiple datasets.
Demonstrated stability conditions for EMA anchor in RL.
Abstract
Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to acquire increasingly complex reasoning and agentic behaviors. In this work, we propose two simple techniques to improve policy gradient algorithms for LLMs. First, we replace the fixed anchor policy during RL with an Exponential Moving Average (EMA), similar to a target network in deep Q-learning. Second, we introduce Top-k KL estimator, which allows for flexible interpolation between exact KL and sampled KL. We derive the stability conditions for using EMA anchor; moreover, we show that our Top-k KL estimator yields both unbiased KL values and unbiased gradients at any k, while bringing the benefits of exact KL. When combined with GRPO, the two techniques (EMA-PG) lead to a significant performance boost. On math reasoning, it allows R1-distilled Qwen-1.5B to reach 53.9% on OlympiadBench compared to 50.8% by GRPO.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Reinforcement Learning in Robotics · Natural Language Processing Techniques
