Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
Omar El Mansouri, Fathinah Asma Izzati, Mohamed El Amine Seddik, Salem Lahlou

TL;DR
This paper introduces a noise-robust policy optimization framework that models reward corruption as Bernoulli noise, providing unbiased gradient estimates and improving performance in noisy reward settings for reinforcement learning from human feedback.
Contribution
It proposes a novel noise correction method for group-based policy optimization that explicitly models reward noise and demonstrates theoretical and empirical robustness improvements.
Findings
Up to 6.7 percentage points accuracy gain on math tasks.
Consistent improvements across math and code tasks.
Theoretical proof of unbiased gradient estimation under reward noise.
Abstract
Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
