Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

Disha Singha

arXiv:2604.26360·cs.LG·April 30, 2026

Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

Disha Singha

PDF

TL;DR

This paper proposes an uncertainty-aware reward framework in reinforcement learning that models both epistemic and preference uncertainties, leading to more stable training and reduced reward hacking.

Contribution

It introduces a dual-source uncertainty model combined with a confidence-adjusted filter to improve alignment and robustness in RL under reward ambiguity.

Findings

01

Achieves 93.7% reduction in reward hacking behavior.

02

Demonstrates stability and robustness across multiple environments.

03

Maintains competitive performance despite uncertainty modeling.

Abstract

Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives--especially those derived from human preferences--are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.