Confronting Reward Model Overoptimization with Constrained RLHF
Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan, Salakhutdinov, Anca D. Dragan, Stephen McAleer

TL;DR
This paper investigates the problem of reward model overoptimization in composite RLHF systems and proposes a constrained reinforcement learning approach with dynamic weighting to maintain alignment with human preferences.
Contribution
It introduces a novel constrained RLHF method with dynamic weights to prevent overoptimization of composite reward models, enhancing alignment accuracy.
Findings
Overoptimization points are influenced by correlation between component RMs.
Constrained RLHF with dynamic weights maintains RMs within effective thresholds.
Adaptive optimization identifies optimal points during a single run.
Abstract
Large language models are typically aligned with human preferences by optimizing (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to , wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We…
Peer Reviews
Decision·ICLR 2024 spotlight
1. The analysis of the over parameterization of composite reward functions in interesting and of importance to LLM-Alignment 2. The proposed derivative free optimization method, NM-PPO is novel and computationally efficient. 3. The paper is generally well written and easy to follow.
**Determining the joint maximizing point seems heuristic** To determine the joint maximizing proxy point, the evaluation scores as a function of the METEOR and intent rewards for each run shown in Fig. 3.1 are plotted and a surface is fit over them. But these evaluations were done by maximizing each reward individually, without account of any interaction. Thus using these to determine the joint maximizing point seems heuristic at best. In other words, if the ultimate objective is to determine
**Originality** The paper presents a unique framework that introduces the concept of "proxy points" to address overoptimization in an environment with multiple proxy reward models and a given ground-truth reward model. This novel idea of defining a threshold for proxy rewards, ensuring that they don't exceed the proxy point, is a commendable original contribution to the field of RLHF. **Quality** The authors have validated the effectiveness of the proposed approach through empirical experime
- One significant limitation, as acknowledged by the authors themselves, is the assumption of the availability of the ground-truth reward model in RLHF. While such an assumption of the gold reward model aids in understanding and analyzing overoptimization as in Gao+ 2022, its actual use within RLHF algorithms seems impractical. - The evaluation metric described in Section A.2 is aimed at respecting both the METEOR and intent reward functions in light of Goodhart's Law. While the authors have me
+ This paper is extremely well-written and contains sufficient related work and explanation to understand the proposed techniques and prior work. + To the best of my knowledge, the proposed technique to avoid LLM overoptimization toward composite RMs is novel. + The analysis is very detailed and showcases the proposed technique works as expected + As LLMs are a very popular topic now and actively being deployed, this approach tackles an important issue in optimizing and improving LLM performanc
- Could you comment on how easy it is to identify the proxy point? In larger or multi-topic datasets, would determining a proxy point be more difficult? - While the paper does contain much information, a lot of important information is in the appendix that would benefit from also appearing briefly in the paper. On that thread, it would be beneficial to highlight important details in D.2. Is there a specific feature of the sample outputs you are attempting to highlight?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
