Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

Rudray Dave; Vedang Dubey; Smit Deoghare; Sudhakar Mishra

arXiv:2605.01823·cs.LG·May 5, 2026

Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

Rudray Dave, Vedang Dubey, Smit Deoghare, Sudhakar Mishra

PDF

TL;DR

This paper introduces a learnable selector model for autonomous curriculum learning in one-shot reinforcement learning, improving reasoning accuracy by focusing on output disagreement over reward variance.

Contribution

It proposes a novel Selector-Guided Autonomous Curriculum (SGAC) that uses multiple features, especially output disagreement, to select training instances, outperforming heuristic-based methods.

Findings

01

SGAC achieves 68.0% accuracy on Hendrycks MATH benchmark.

02

Output disagreement is a stronger predictor of reasoning gains than reward variance.

03

SGAC outperforms state-of-the-art models and previous RLVR checkpoints.

Abstract

Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-of-the-art 1-shot RLVR models adopt heuristics for selecting instances, mostly based on historical variance in rewards, which we find to be inherently misleading as a measure of transferability value. In this paper, we propose a Selector-Guided Autonomous Curriculum (SGAC) approach, which employs a learnable selector model on a multi-dimensional feature space consisting of success probability, reward variance, output disagreement (entropy), and semantic difficulty level, instead of the static reward variance heuristic. In our empirical evaluation on pools of candidate problems, we observed that output disagreement, rather than reward variance, is the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.