Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking
Paria Rashidinejad, Yuandong Tian

TL;DR
This paper introduces POWER-DL, a novel method for aligning AI systems with human preferences by mitigating reward hacking through robust reward maximization and dynamic label updates, showing significant empirical improvements.
Contribution
It proposes POWER, a new preference optimization approach with finite-sample guarantees, and a dynamic label technique to reduce reward hacking in offline preference optimization.
Findings
POWER-DL outperforms state-of-the-art methods on alignment benchmarks.
Empirical improvements of up to 13.0 points on AlpacaEval 2.0.
Theoretical guarantees support robustness against reward hacking.
Abstract
Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Guidance and Control Systems · Robotic Path Planning Algorithms
MethodsDirect Preference Optimization
