Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards
Rudray Dave, Vedang Dubey, Smit Deoghare, Sudhakar Mishra

TL;DR
This paper introduces a learnable selector model for autonomous curriculum learning in one-shot reinforcement learning, improving reasoning accuracy by focusing on output disagreement over reward variance.
Contribution
It proposes a novel Selector-Guided Autonomous Curriculum (SGAC) that uses multiple features, especially output disagreement, to select training instances, outperforming heuristic-based methods.
Findings
SGAC achieves 68.0% accuracy on Hendrycks MATH benchmark.
Output disagreement is a stronger predictor of reasoning gains than reward variance.
SGAC outperforms state-of-the-art models and previous RLVR checkpoints.
Abstract
Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-of-the-art 1-shot RLVR models adopt heuristics for selecting instances, mostly based on historical variance in rewards, which we find to be inherently misleading as a measure of transferability value. In this paper, we propose a Selector-Guided Autonomous Curriculum (SGAC) approach, which employs a learnable selector model on a multi-dimensional feature space consisting of success probability, reward variance, output disagreement (entropy), and semantic difficulty level, instead of the static reward variance heuristic. In our empirical evaluation on pools of candidate problems, we observed that output disagreement, rather than reward variance, is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
