Stable Reinforcement Learning for Efficient Reasoning
Muzhi Dai, Shixuan Liu, Qingyi Si

TL;DR
This paper introduces GRPO-λ, a stabilized reinforcement learning method for large language models that balances reasoning accuracy and efficiency by dynamically adjusting reward strategies based on correctness ratios during training.
Contribution
It proposes a novel dynamic reward adjustment technique, GRPO-λ, to stabilize RL training and improve reasoning performance while reducing reasoning chain length.
Findings
Improves average accuracy by 1.48% on multiple benchmarks.
Reduces chain-of-thought sequence length by 47.3%.
Avoids training instability caused by length penalties.
Abstract
The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
