From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Zhuohao Yu; Zhiwei Steven Wu; Adam Block

arXiv:2604.04648·cs.LG·April 7, 2026

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-N with Pessimism

Zhuohao Yu, Zhiwei Steven Wu, Adam Block

PDF

1 Video

TL;DR

This paper introduces a cautious approach using lower confidence bounds to mitigate reward hacking in best-of-N sampling for language models, improving robustness and performance.

Contribution

It proposes a novel caution method based on pessimism principles, effectively reducing reward hacking and outperforming prior mitigation techniques.

Findings

01

Caution significantly reduces reward hacking in BoN sampling.

02

The approach is computationally efficient and easy to implement.

03

Theoretical analysis confirms the method's superiority in a simplified setting.

Abstract

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is BoN sampling, where N candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as N increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking, via stronger reward models or heavy-handed distributional regularization, either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in RL,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-$N$ with Pessimism· slideslive