Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Johannes Ackermann; Takashi Ishida; Masashi Sugiyama

arXiv:2507.15507·cs.LG·July 22, 2025

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Johannes Ackermann, Takashi Ishida, Masashi Sugiyama

PDF

TL;DR

This paper introduces Off-Policy Corrected Reward Modeling (OCRM), a method to improve reward models in RLHF by addressing distribution shift, leading to better language model alignment with human preferences.

Contribution

The paper proposes OCRM, an iterative importance weighting technique that corrects reward models without additional labels, enhancing policy performance in RLHF tasks.

Findings

01

OCRM outperforms standard RLHF in summarization tasks.

02

OCRM significantly improves chatbot response quality.

03

The method effectively mitigates overoptimization issues.

Abstract

Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.