Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Aswin RRV, Jacob Dineen, Divij Handa, Mihir Parmar, Ben Zhou, Swaroop Mishra, Chitta Baral

TL;DR
This paper shows that using diverse, self-generated data during mid-training enhances reinforcement learning in large language models, leading to better performance on reasoning and out-of-distribution tasks.
Contribution
It introduces a bootstrapped data-generation framework guided by Polya's problem-solving methods for improved RL in language models.
Findings
Models with mid-training data outperform baselines on reasoning benchmarks.
Self-generated data increases model robustness on out-of-distribution tasks.
Theoretical analysis explains how diverse data influences policy-gradient updates.
Abstract
The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
