TL;DR
Miner leverages intrinsic uncertainty as a self-supervised reward to improve data-efficient reinforcement learning in large reasoning models, achieving state-of-the-art results without external supervision.
Contribution
Introduces a simple, effective method that uses intrinsic uncertainty for reward signals, with novel token-level credit assignment and adaptive advantage calibration.
Findings
Achieves up to 4.58 absolute gains in Pass@1 over previous methods.
Outperforms other exploration-focused algorithms on six reasoning benchmarks.
Demonstrates latent uncertainty exploitation is key for scalable RL in reasoning models.
Abstract
Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
