When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin

TL;DR
This paper analyzes how imperfect proxy rewards in reinforcement learning for language models can sometimes be beneficial, challenging the view that all reward errors are harmful, and offers new evaluation metrics and insights for reward design.
Contribution
The work provides a theoretical categorization of reward errors based on their impact, revealing that some errors can be benign or beneficial, and introduces improved reward evaluation metrics for RLHF.
Findings
Reward errors can be benign or beneficial, not just harmful.
New metrics for reward model evaluation better correlate with language model performance.
Insights into reward design depending on policy interaction and learning algorithms.
Abstract
Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat incorrect rewards as strictly harmful. In this work, however, we highlight that not all deviations from the ground truth are equal. By theoretically analyzing which outputs attract probability during policy gradient optimization, we categorize reward errors according to their effect on the increase in ground truth reward. The analysis establishes that reward errors, though conventionally viewed as harmful, can also be benign or even beneficial by preventing the policy from stalling around outputs with mediocre ground truth reward. We then present two practical implications of our theory. First, for reinforcement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
