An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
Andreas Plesner, Francisco Guzm\'an, Anish Athalye

TL;DR
This paper demonstrates that reinforcement learning with noisy, imperfect verifiers can still achieve near-optimal performance, emphasizing the robustness of RLVR to verification errors up to 15%.
Contribution
It provides empirical evidence that RLVR remains effective despite verifier inaccuracies, challenging the need for perfect verification in training large language models.
Findings
Up to 15% noise in verification yields only 2% drop in validation accuracy.
Results are consistent across different models, noise types, and sizes.
Moderate, high-precision verification is preferable over perfect but costly checks.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent method for post-training Large Language Models (LLMs). However, verifiers are rarely error-free; even deterministic checks can be inaccurate, and the growing dependence on model-based judges exacerbates the issue. The extent to which RLVR is robust to such noise and the verifier accuracy required for effective training remain unresolved questions. We investigate these questions in the domains of code generation and scientific reasoning by introducing noise into RL training. Noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline. These findings are consistent across controlled and model-based noise types, three model families (Qwen3, GLM4, Llama 3.1), and model sizes from 4B to 9B. Overall, the results indicate that imperfect verification does not constitute a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
