Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels   against Reward Hacking

Paria Rashidinejad; Yuandong Tian

arXiv:2412.09544·cs.LG·December 13, 2024

Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

Paria Rashidinejad, Yuandong Tian

PDF

Open Access

TL;DR

This paper introduces POWER-DL, a novel method for aligning AI systems with human preferences by mitigating reward hacking through robust reward maximization and dynamic label updates, showing significant empirical improvements.

Contribution

It proposes POWER, a new preference optimization approach with finite-sample guarantees, and a dynamic label technique to reduce reward hacking in offline preference optimization.

Findings

01

POWER-DL outperforms state-of-the-art methods on alignment benchmarks.

02

Empirical improvements of up to 13.0 points on AlpacaEval 2.0.

03

Theoretical guarantees support robustness against reward hacking.

Abstract

Aligning AI systems with human preferences typically suffers from the infamous reward hacking problem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization, which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less favorable. We prove that many (mainstream or theoretical) preference optimization methods suffer from both types of reward hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference optimization method that combines Guiasu's weighted entropy with a robust reward maximization objective. POWER enjoys finite-sample…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Guidance and Control Systems · Robotic Path Planning Algorithms

MethodsDirect Preference Optimization