Regularizing Hidden States Enables Learning Generalizable Reward Model   for LLMs

Rui Yang; Ruomeng Ding; Yong Lin; Huan Zhang; Tong Zhang

arXiv:2406.10216·cs.CL·October 24, 2024

Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang

PDF

Open Access 2 Repos 10 Models

TL;DR

This paper proposes a novel regularization method for reward models in LLMs that improves their generalization to unseen data and reduces reward over-optimization by preserving hidden state representations.

Contribution

It introduces a regularization technique that maintains language model capabilities during reward model training, enhancing out-of-distribution performance and robustness.

Findings

01

Improved reward model accuracy on OOD tasks

02

Reduced reward over-optimization in RLHF

03

Enhanced robustness of preference learning

Abstract

Reward models trained on human preference data have been proven to effectively align Large Language Models (LLMs) with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models have limited generalization capabilities to unseen prompts and responses, which can lead to an unexpected phenomenon known as reward over-optimization, resulting in a decline in actual performance due to excessive optimization of rewards. While previous research has advocated for constraining policy optimization, our study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities, while concurrently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImbalanced Data Classification Techniques

MethodsALIGN · Balanced Selection