Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Chujie Zheng; Kai Dang; Bowen Yu; Mingze Li; Huiqiang Jiang; Junrong Lin; Yuqiong Liu; Hao Lin; Chencan Wu; Feng Hu; An Yang; Jingren Zhou; Junyang Lin

arXiv:2512.01374·cs.LG·December 4, 2025

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, An Yang, Jingren Zhou, Junyang Lin

PDF

Open Access

TL;DR

This paper introduces a new formulation for reinforcement learning with large language models, explaining the conditions under which surrogate objectives effectively optimize sequence-level rewards, and provides practical stabilization techniques validated through extensive experiments.

Contribution

It offers a theoretical framework for understanding RL stability with LLMs and develops practical training recipes validated on large-scale models.

Findings

01

Importance sampling correction improves training stability.

02

Clipping and Routing Replay are crucial for off-policy stability.

03

Stable training leads to consistent final performance regardless of initialization.

Abstract

This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification