TL;DR
This paper introduces DARS, a new adaptive sampling method that enhances RLVR's reasoning by balancing difficulty and breadth, leading to improved performance in large language model training.
Contribution
It proposes DARS and DARS-Breadth, novel techniques that improve exploration and scaling in RLVR, demonstrating their effectiveness in boosting reasoning capabilities.
Findings
DARS re-weights difficult problems to improve reasoning.
Scaling batch size increases breadth and boosts Pass@1.
Combining DARS with large breadth yields the best performance.
Abstract
Reinforcement Learning with Verifiable Reward (RLVR) is a powerful method for enhancing the reasoning abilities of Large Language Models, but its full potential is limited by a lack of exploration in two key areas: Depth (the difficulty of problems) and Breadth (the number of training instances). Our analysis of the popular GRPO algorithm reveals a bias that down-weights difficult, low-accuracy problems, which are crucial for improving reasoning skills. To address this, we introduce Difficulty Adaptive Rollout Sampling (DARS), a method that re-weights difficult problems by using targeted, multi-stage rollouts. DARS increases the number of rollout outcomes for these harder problems according to our proposed re-balancing schedules and leads to consistent gains in Pass@K. We discovered that increasing rollout size alone does not improve performance and may actually impair it. In contrast,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
