One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber

TL;DR
This paper investigates persistent biases in language reward models, identifies new bias types, and proposes a mechanistic reward shaping method that mitigates biases without harming reward quality, using minimal data.
Contribution
It systematically measures biases in high-quality RMs, introduces a simple post-hoc intervention, and demonstrates effective bias mitigation with minimal labeled data.
Findings
Biases persist despite prior work on length, sycophancy, and overconfidence.
New biases related to model-specific styles and answer order are identified.
Mechanistic reward shaping reduces biases without degrading reward quality.
Abstract
Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Recommender Systems and Techniques
