Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training

Mingjie Liu; Shizhe Diao; Jian Hu; Ximing Lu; Xin Dong; Hao Zhang; Alexander Bukharin; Shaokun Zhang; Jiaqi Zeng; Makesh Narsimhan Sreedhar; Gerald Shen; David Mosallanezhad; Di Zhang; Jonas Yang; June Yang; Oleksii Kuchaiev; Guilin Liu; Zhiding Yu; Pavlo Molchanov; Yejin Choi; Jan Kautz; Yi Dong

arXiv:2507.12507·cs.LG·July 18, 2025

Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training

Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, Gerald Shen, David Mosallanezhad, Di Zhang, Jonas Yang, June Yang, Oleksii Kuchaiev, Guilin Liu, Zhiding Yu, Pavlo Molchanov, Yejin Choi

PDF

Open Access

TL;DR

This paper demonstrates that prolonged reinforcement learning with specific techniques significantly enhances reasoning abilities in small language models across diverse tasks, emphasizing the importance of reward signals and training stability.

Contribution

It introduces effective training techniques like KL regularization and reference policy resets, showing substantial performance gains in reasoning tasks.

Findings

01

+14.7% on math tasks

02

+13.9% on coding tasks

03

+54.8% on logic puzzles

Abstract

Recent advancements in reasoning-focused language models such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law