Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
Johannes Ackermann, Takashi Ishida, Masashi Sugiyama

TL;DR
This paper introduces Off-Policy Corrected Reward Modeling (OCRM), a method to improve reward models in RLHF by addressing distribution shift, leading to better language model alignment with human preferences.
Contribution
The paper proposes OCRM, an iterative importance weighting technique that corrects reward models without additional labels, enhancing policy performance in RLHF tasks.
Findings
OCRM outperforms standard RLHF in summarization tasks.
OCRM significantly improves chatbot response quality.
The method effectively mitigates overoptimization issues.
Abstract
Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
