When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Shuning Shang; Hubert Strauss; Stanley Wei; Sanjeev Arora; Noam Razin

arXiv:2604.25872·cs.LG·April 29, 2026

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin

PDF

TL;DR

This paper analyzes how imperfect proxy rewards in reinforcement learning for language models can sometimes be beneficial, challenging the view that all reward errors are harmful, and offers new evaluation metrics and insights for reward design.

Contribution

The work provides a theoretical categorization of reward errors based on their impact, revealing that some errors can be benign or beneficial, and introduces improved reward evaluation metrics for RLHF.

Findings

01

Reward errors can be benign or beneficial, not just harmful.

02

New metrics for reward model evaluation better correlate with language model performance.

03

Insights into reward design depending on policy interaction and learning algorithms.

Abstract

Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.