Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning
Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe

TL;DR
Goldilocks RL introduces a teacher-driven data sampling method that dynamically selects appropriately challenging questions to improve reasoning in language models trained with reinforcement learning, addressing sparse reward issues.
Contribution
The paper presents Goldilocks, a novel adaptive sampling strategy that predicts question difficulty to enhance RL training efficiency for large-scale language models.
Findings
Goldilocks sampling improves model performance on OpenMathReasoning dataset.
Adaptive question difficulty selection accelerates learning in sparse reward settings.
The method outperforms standard training approaches under the same compute budget.
Abstract
Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, prior works have primarily targeted small datasets and do not directly transfer to the large-scale settings typical of modern LM training. Furthermore, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
