Stable Reinforcement Learning for Efficient Reasoning

Muzhi Dai; Shixuan Liu; Qingyi Si

arXiv:2505.18086·cs.AI·May 26, 2025

Stable Reinforcement Learning for Efficient Reasoning

Muzhi Dai, Shixuan Liu, Qingyi Si

PDF

TL;DR

This paper introduces GRPO-λ, a stabilized reinforcement learning method for large language models that balances reasoning accuracy and efficiency by dynamically adjusting reward strategies based on correctness ratios during training.

Contribution

It proposes a novel dynamic reward adjustment technique, GRPO-λ, to stabilize RL training and improve reasoning performance while reducing reasoning chain length.

Findings

01

Improves average accuracy by 1.48% on multiple benchmarks.

02

Reduces chain-of-thought sequence length by 47.3%.

03

Avoids training instability caused by length penalties.

Abstract

The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO- $λ$ , an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.