ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong

TL;DR
ProRL demonstrates that prolonged reinforcement learning can uncover new reasoning strategies in large language models, surpassing base model capabilities and expanding their reasoning boundaries through specific training techniques.
Contribution
The paper introduces ProRL, a novel RL training method with KL control and reference resets, showing it can enhance reasoning abilities beyond base models.
Findings
RL-trained models outperform base models on pass@k evaluations
ProRL uncovers reasoning strategies inaccessible to base models
Reasoning improvements correlate with base model competence and training duration
Abstract
Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/Nemotron-Research-Reasoning-Qwen-1.5Bmodel· 2.8k dl· ♡ 2412.8k dl♡ 241
- 🤗Mungert/Nemotron-Research-Reasoning-Qwen-1.5B-GGUFmodel· 86 dl· ♡ 186 dl♡ 1
- 🤗QuantFactory/Nemotron-Research-Reasoning-Qwen-1.5B-GGUFmodel· 143 dl· ♡ 3143 dl♡ 3
- 🤗ethan1115/Nemotron-Research-Reasoning-Qwen-1.5Bmodel
- 🤗shizhediao2/Llama-Nemotron-8B-v1-Prorlmodel
- 🤗nvidia/Nemotron-Research-GooseReason-4B-Instructmodel· 474 dl· ♡ 7474 dl♡ 7
- 🤗daydreamwarrior/Nemotron-Research-GooseReason-4B-Instruct-heretic-v2model· 76 dl· ♡ 176 dl♡ 1
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsBalanced Selection
