Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking
Disha Singha

TL;DR
This paper proposes an uncertainty-aware reward framework in reinforcement learning that models both epistemic and preference uncertainties, leading to more stable training and reduced reward hacking.
Contribution
It introduces a dual-source uncertainty model combined with a confidence-adjusted filter to improve alignment and robustness in RL under reward ambiguity.
Findings
Achieves 93.7% reduction in reward hacking behavior.
Demonstrates stability and robustness across multiple environments.
Maintains competitive performance despite uncertainty modeling.
Abstract
Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives--especially those derived from human preferences--are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior. We introduce a dual-source uncertainty-aware reward framework that explicitly models both epistemic uncertainty in value estimation and uncertainty in human preferences. Model uncertainty is captured via ensemble disagreement over value predictions, while preference uncertainty is derived from variability in reward annotations. We combine these signals through a confidence-adjusted Reliability Filter that adaptively modulates action selection, encouraging a balance between exploitation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
