TL;DR
NudgeRL introduces a strategy-guided exploration framework for RLVR that enhances reasoning capabilities of language models by promoting diverse trajectories without expensive supervision, outperforming standard methods.
Contribution
The paper proposes a novel structured exploration method, Strategy Nudging, and a unified learning objective to improve RLVR efficiency and diversity, surpassing existing baselines.
Findings
NudgeRL outperforms standard GRPO with up to 8x larger rollout budgets.
It surpasses oracle-guided RL baselines on five math benchmarks.
Structured exploration can replace brute-force scaling and privileged information methods.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
