Reward Shaping to Mitigate Reward Hacking in RLHF

Jiayi Fu; Xuandong Zhao; Chengyuan Yao; Heng Wang; Qi Han; Yanghua Xiao

arXiv:2502.18770·cs.LG·January 22, 2026

Reward Shaping to Mitigate Reward Hacking in RLHF

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao

PDF

Open Access 1 Repo

TL;DR

This paper introduces Preference As Reward (PAR), a novel reward shaping method for RLHF that reduces reward hacking, stabilizes training, and improves performance and data efficiency in aligning language models with human preferences.

Contribution

The paper systematically analyzes reward shaping principles and proposes PAR, leveraging latent preferences for improved stability, robustness, and efficiency in RLHF training.

Findings

01

PAR outperforms other reward shaping methods on benchmarks.

02

PAR achieves at least 5% higher win rate on AlpacaEval 2.0.

03

PAR maintains robustness against reward hacking after extensive training.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

poruna-byte/par
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Reinforcement Learning in Robotics

MethodsBalanced Selection