Reward Shaping to Mitigate Reward Hacking in RLHF
Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao

TL;DR
This paper introduces Preference As Reward (PAR), a novel reward shaping method for RLHF that reduces reward hacking, stabilizes training, and improves performance and data efficiency in aligning language models with human preferences.
Contribution
The paper systematically analyzes reward shaping principles and proposes PAR, leveraging latent preferences for improved stability, robustness, and efficiency in RLHF training.
Findings
PAR outperforms other reward shaping methods on benchmarks.
PAR achieves at least 5% higher win rate on AlpacaEval 2.0.
PAR maintains robustness against reward hacking after extensive training.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Reinforcement Learning in Robotics
MethodsBalanced Selection
