TL;DR
This paper introduces Online Label Refinement (OLR), a method to improve reasoning model robustness under noisy supervision by progressively correcting labels based on rollout pass rates and historical consistency.
Contribution
It systematically analyzes noisy label mechanisms in RLVR and proposes OLR, a self-correcting approach that enhances model robustness across various reasoning benchmarks.
Findings
OLR improves robustness across in-distribution and out-of-distribution benchmarks.
OLR achieves average gains of 3.6% to 3.9% on in-distribution tasks.
OLR achieves 3.3% to 4.6% improvements on out-of-distribution evaluations.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
