Inference-Time Reward Hacking in Large Language Models
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon

TL;DR
This paper investigates reward hacking in large language models during inference, characterizes its patterns, and proposes HedgeTune to mitigate it, improving alignment with user preferences and safety goals.
Contribution
It introduces the Best-of-Poisson method for efficient reward approximation and HedgeTune algorithm to reduce reward hacking during inference.
Findings
Reward hacking exhibits a characteristic pattern of initial increase then decline.
Hedging on proxy rewards can mitigate reward hacking effects.
HedgeTune improves reward-distortion tradeoffs across tasks.
Abstract
A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM's output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance, a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of- (BoN) and Soft Best-of- (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
