Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang

TL;DR
This paper proposes a novel regularization method for reward models in LLMs that improves their generalization to unseen data and reduces reward over-optimization by preserving hidden state representations.
Contribution
It introduces a regularization technique that maintains language model capabilities during reward model training, enhancing out-of-distribution performance and robustness.
Findings
Improved reward model accuracy on OOD tasks
Reduced reward over-optimization in RLHF
Enhanced robustness of preference learning
Abstract
Reward models trained on human preference data have been proven to effectively align Large Language Models (LLMs) with human intent within the framework of reinforcement learning from human feedback (RLHF). However, current reward models have limited generalization capabilities to unseen prompts and responses, which can lead to an unexpected phenomenon known as reward over-optimization, resulting in a decline in actual performance due to excessive optimization of rewards. While previous research has advocated for constraining policy optimization, our study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts by regularizing the hidden states. Specifically, we retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities, while concurrently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedbackmodel· 96 dl· ♡ 1196 dl♡ 11
- 🤗Ray2333/GRM-llama3-8B-sftregmodel· 18 dl· ♡ 518 dl♡ 5
- 🤗Ray2333/GRM-llama3-8B-distillmodel· 393 dl· ♡ 6393 dl♡ 6
- 🤗Ray2333/GRM-Gemma-2B-sftregmodel· 827 dl· ♡ 3827 dl♡ 3
- 🤗Ray2333/GRM-Gemma-2B-rewardmodel-ftmodel· 643 dl· ♡ 1643 dl♡ 1
- 🤗Ray2333/GRM-Llama3-8B-rewardmodel-ftmodel· 97 dl· ♡ 197 dl♡ 1
- 🤗Ray2333/GRM-Gemma2-2B-sftregmodel· 7 dl· ♡ 27 dl♡ 2
- 🤗Ray2333/GRM-llama3.2-3B-sftregmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗Ray2333/GRM-Llama3.2-3B-rewardmodel-ftmodel· 1.5k dl· ♡ 131.5k dl♡ 13
- 🤗Ray2333/GRM-gemma2-2B-rewardmodel-ftmodel· 116 dl· ♡ 7116 dl♡ 7
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques
MethodsALIGN · Balanced Selection
