Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards

Ali Rad; Khashayar Filom; Darioush Keivan; Peyman Mohajerin Esfahani; Ehsan Kamalinejad

arXiv:2601.04411·cs.LG·January 9, 2026

Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards

Ali Rad, Khashayar Filom, Darioush Keivan, Peyman Mohajerin Esfahani, Ehsan Kamalinejad

PDF

Open Access

TL;DR

This paper models reinforcement learning with noisy verification as a multi-armed bandit problem, revealing a phase transition determined by Youden's index that dictates whether learning succeeds or fails in noisy environments.

Contribution

It introduces an analytically tractable bandit model for RLVR with noisy rewards, identifying a phase transition based on Youden's index that predicts learning success or collapse.

Findings

01

A sharp phase transition at J=0 determines learning success.

02

Noise primarily affects convergence rate, not the ultimate outcome.

03

The framework generalizes to analyze RLVR stability and interventions.

Abstract

Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean--unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited--and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Evolutionary Algorithms and Applications · Adversarial Robustness in Machine Learning