Can LLMs Learn to Reason Robustly under Noisy Supervision?

Shenzhi Yang; Guangcheng Zhu; Bowen Song; Sharon Li; Haobo Wang; Xing Zheng; Yingfan Ma; Zhongqi Chen; Weiqiang Wang; Gang Chen

arXiv:2604.03993·cs.LG·April 7, 2026

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Shenzhi Yang, Guangcheng Zhu, Bowen Song, Sharon Li, Haobo Wang, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

PDF

1 Repo

TL;DR

This paper introduces Online Label Refinement (OLR), a method to improve reasoning model robustness under noisy supervision by progressively correcting labels based on rollout pass rates and historical consistency.

Contribution

It systematically analyzes noisy label mechanisms in RLVR and proposes OLR, a self-correcting approach that enhances model robustness across various reasoning benchmarks.

Findings

01

OLR improves robustness across in-distribution and out-of-distribution benchmarks.

02

OLR achieves average gains of 3.6% to 3.9% on in-distribution tasks.

03

OLR achieves 3.3% to 4.6% improvements on out-of-distribution evaluations.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shenzhiyang2000/OLR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.