Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

Omar El Mansouri; Fathinah Asma Izzati; Mohamed El Amine Seddik; Salem Lahlou

arXiv:2510.18924·cs.LG·May 20, 2026

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

Omar El Mansouri, Fathinah Asma Izzati, Mohamed El Amine Seddik, Salem Lahlou

PDF

TL;DR

This paper introduces a noise-robust policy optimization framework that models reward corruption as Bernoulli noise, providing unbiased gradient estimates and improving performance in noisy reward settings for reinforcement learning from human feedback.

Contribution

It proposes a novel noise correction method for group-based policy optimization that explicitly models reward noise and demonstrates theoretical and empirical robustness improvements.

Findings

01

Up to 6.7 percentage points accuracy gain on math tasks.

02

Consistent improvements across math and code tasks.

03

Theoretical proof of unbiased gradient estimation under reward noise.

Abstract

Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.