Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling
Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

TL;DR
This paper introduces BNRM, a Bayesian non-negative reward model that enhances the robustness and interpretability of reward learning in LLMs by mitigating reward hacking and systematic biases.
Contribution
The paper proposes BNRM, a novel reward modeling framework combining non-negative factor analysis with preference modeling, enabling disentangled, debiased, and uncertainty-aware reward learning.
Findings
BNRM reduces reward over-optimization.
BNRM improves robustness under distribution shifts.
BNRM provides more interpretable reward decompositions.
Abstract
Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Emotion and Mood Recognition
