Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble
Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun,, Chuang Gan

TL;DR
This paper introduces an efficient reward model ensemble approach for RLHF that improves alignment accuracy of language models with human values, using resource-efficient ensemble techniques like linear-layer and LoRA-based methods.
Contribution
We propose a novel reward ensemble method with efficient ensemble techniques to enhance RLHF alignment performance without high computational costs.
Findings
Ensemble methods improve RLHF output alignment.
Linear-layer and LoRA-based ensembles are computationally efficient.
Empirical results show better alignment with ensemble reward models.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. As a result, RLHF may produce outputs that are misaligned with human values. To mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. As using an ensemble of large language model-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and LoRA-based ensemble. Empirically, we run Best-of- and Proximal Policy Optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of RLHF outputs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
