Reward Model Ensembles Help Mitigate Overoptimization
Thomas Coste, Usman Anwar, Robert Kirk, David Krueger

TL;DR
This paper demonstrates that ensemble-based conservative optimization methods, such as worst-case and uncertainty-weighted optimization, effectively mitigate reward model overoptimization in reinforcement learning from human feedback, improving model performance and robustness.
Contribution
The study systematically evaluates ensemble-based conservative optimization techniques for overoptimization mitigation in RLHF, extending prior work with noisy labels and multiple optimization methods.
Findings
Conservative optimization eliminates overoptimization in BoN sampling.
Ensemble methods outperform single reward models in PPO.
Combining conservative optimization with KL penalty prevents overoptimization without performance loss.
Abstract
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for…
Peer Reviews
Decision·ICLR 2024 poster
1. The experimental results look promising. 2. The overoptimization problem is a novel and important problem to the LLM community.
1. [Major] There are some writing and presentation issues in the manuscripts. While this manuscript extensively refers to [1] in the writing, the reviewer would recommend the authors update the paper so that the current submission does not require the readers to read [1] to understand the submission thoroughly. See detailed comments in [Questions]. 2. [Minor] For paragraph Supervied Fine-tuning in Section 4.3, the hyperlink `(see Section 4.1 for details)` seems to be broken. In addition, the rev
1. This paper provides extensive empirical evidence that suggests that ensemble-based methods can improve robustness of RMs and reduces overoptimization, which makes the claims of the paper well-supported. 2. The paper studies the important problem in RLHF, i.e. reward overoptimization, and presents various methods that clearly mitigate such a challenge. I think the paper is of value to the RLHF community.
1. While the empirical results are quite comprehensive in the paper, the model size seems a bit small with the biggest RM being 1.3B. Given Figure 8, it seems that the gain of the ensemble-based methods diminishes as the model size increases. It would important to investigate if ensemble-based methods have little gain with even bigger models, which are more commonly used by users. 2. From Figure 9, it seems that with bigger dataset size (46K), ensemble-based methods are not that much better tha
The paper addresses a very important problem—one of the main bottlenecks to improving RLHF training and safety of LLM-based chatbots is making reward models more robust. The solution technique is not particularly novel, as pessimistic optimization using an ensemble has been widely used in model-based RL, offline RL, preference learning, etc. However, I don't know of prior work that has specifically evaluated this technique for RLHF on LLM-based chatbots. Thus, I view the primary contribution of
In terms of the experiments, one weakness is that the PPO experiments seem to be mostly done with a single random seed, while due to the high noise in RL training it is best to use a few random seeds (see https://arxiv.org/abs/2108.13264, https://arxiv.org/abs/2304.01315). Another weakness of the results is that it's hard to know how to interpret the gold reward. How much better is an LLM with an average gold reward increase of 0.5 vs. 0.4? AlpacaFarm and others use a win-rate which is more int
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling
MethodsEntropy Regularization · Proximal Policy Optimization
