Learning a Pessimistic Reward Model in RLHF

Yinglun Xu; Hangoo Kang; Tarun Suresh; Yuxuan Wan; Gagandeep Singh

arXiv:2505.20556·cs.LG·May 28, 2025

Learning a Pessimistic Reward Model in RLHF

Yinglun Xu, Hangoo Kang, Tarun Suresh, Yuxuan Wan, Gagandeep Singh

PDF

Open Access

TL;DR

This paper introduces PET, a new method for fine-tuning reward models in RLHF that prevents reward hacking by adopting a pessimistic approach, enabling high-quality policy learning without regularization.

Contribution

The paper presents PET, a novel pessimistic reward fine-tuning technique that effectively mitigates reward hacking without regularization in offline RLHF settings.

Findings

01

High-quality policies can be learned without regularization.

02

Pessimistic reward models prevent reward hacking.

03

Policies with high KL divergence can still perform well.

Abstract

This work proposes `PET', a novel pessimistic reward fine-tuning method, to learn a pessimistic reward model robust against reward hacking in offline reinforcement learning from human feedback (RLHF). Traditional reward modeling techniques in RLHF train an imperfect reward model, on which a KL regularization plays a pivotal role in mitigating reward hacking when optimizing a policy. Such an intuition-based method still suffers from reward hacking, and the policies with large KL divergence from the dataset distribution are excluded during learning. In contrast, we show that when optimizing a policy on a pessimistic reward model fine-tuned through PET, reward hacking can be prevented without relying on any regularization. We test our methods on the standard TL;DR summarization dataset. We find that one can learn a high-quality policy on our pessimistic reward without using any…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis · Safety Systems Engineering in Autonomy