Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Zhibin Duan; Guowei Rong; Zhuo Li; Bo Chen; Mingyuan Zhou; Dandan Guo

arXiv:2602.10623·cs.LG·February 12, 2026

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

PDF

Open Access

TL;DR

This paper introduces BNRM, a Bayesian non-negative reward model that enhances the robustness and interpretability of reward learning in LLMs by mitigating reward hacking and systematic biases.

Contribution

The paper proposes BNRM, a novel reward modeling framework combining non-negative factor analysis with preference modeling, enabling disentangled, debiased, and uncertainty-aware reward learning.

Findings

01

BNRM reduces reward over-optimization.

02

BNRM improves robustness under distribution shifts.

03

BNRM provides more interpretable reward decompositions.

Abstract

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Recommender Systems and Techniques · Emotion and Mood Recognition