Improving Reinforcement Learning from Human Feedback with Efficient   Reward Model Ensemble

Shun Zhang; Zhenfang Chen; Sunli Chen; Yikang Shen; Zhiqing Sun,; Chuang Gan

arXiv:2401.16635·cs.LG·October 23, 2024·2 cites

Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble

Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun,, Chuang Gan

PDF

Open Access

TL;DR

This paper introduces an efficient reward model ensemble approach for RLHF that improves alignment accuracy of language models with human values, using resource-efficient ensemble techniques like linear-layer and LoRA-based methods.

Contribution

We propose a novel reward ensemble method with efficient ensemble techniques to enhance RLHF alignment performance without high computational costs.

Findings

01

Ensemble methods improve RLHF output alignment.

02

Linear-layer and LoRA-based ensembles are computationally efficient.

03

Empirical results show better alignment with ensemble reward models.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning large language models with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. As a result, RLHF may produce outputs that are misaligned with human values. To mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. As using an ensemble of large language model-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and LoRA-based ensemble. Empirically, we run Best-of- $n$ and Proximal Policy Optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of RLHF outputs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics