Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
Shinnosuke Ono, Johannes Ackermann, Soichiro Nishimori, Takashi Ishida, Masashi Sugiyama

TL;DR
This paper introduces SignCert-PO, a method to reduce reward hacking in RLHF by certifying advantage sign robustness, improving policy outcomes without needing multiple RMs or training data.
Contribution
It proposes a novel, lightweight approach that certifies advantage sign robustness during policy optimization to mitigate reward hacking in RLHF.
Findings
SignCert-PO outperforms baselines on TL;DR and AlpacaFarm benchmarks.
It reduces reward hacking and improves win rates.
The method operates solely at the policy optimization stage, requiring only RM parameters.
Abstract
Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
