Confronting Reward Model Overoptimization with Constrained RLHF

Ted Moskovitz; Aaditya K. Singh; DJ Strouse; Tuomas Sandholm; Ruslan; Salakhutdinov; Anca D. Dragan; Stephen McAleer

arXiv:2310.04373·cs.LG·October 11, 2023·1 cites

Confronting Reward Model Overoptimization with Constrained RLHF

Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan, Salakhutdinov, Anca D. Dragan, Stephen McAleer

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the problem of reward model overoptimization in composite RLHF systems and proposes a constrained reinforcement learning approach with dynamic weighting to maintain alignment with human preferences.

Contribution

It introduces a novel constrained RLHF method with dynamic weights to prevent overoptimization of composite reward models, enhancing alignment accuracy.

Findings

01

Overoptimization points are influenced by correlation between component RMs.

02

Constrained RLHF with dynamic weights maintains RMs within effective thresholds.

03

Adaptive optimization identifies optimal points during a single run.

Abstract

Large language models are typically aligned with human preferences by optimizing $reward models$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $overoptimization$ , wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

1. The analysis of the over parameterization of composite reward functions in interesting and of importance to LLM-Alignment 2. The proposed derivative free optimization method, NM-PPO is novel and computationally efficient. 3. The paper is generally well written and easy to follow.

Weaknesses

**Determining the joint maximizing point seems heuristic** To determine the joint maximizing proxy point, the evaluation scores as a function of the METEOR and intent rewards for each run shown in Fig. 3.1 are plotted and a surface is fit over them. But these evaluations were done by maximizing each reward individually, without account of any interaction. Thus using these to determine the joint maximizing point seems heuristic at best. In other words, if the ultimate objective is to determine

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

**Originality** The paper presents a unique framework that introduces the concept of "proxy points" to address overoptimization in an environment with multiple proxy reward models and a given ground-truth reward model. This novel idea of defining a threshold for proxy rewards, ensuring that they don't exceed the proxy point, is a commendable original contribution to the field of RLHF. **Quality** The authors have validated the effectiveness of the proposed approach through empirical experime

Weaknesses

- One significant limitation, as acknowledged by the authors themselves, is the assumption of the availability of the ground-truth reward model in RLHF. While such an assumption of the gold reward model aids in understanding and analyzing overoptimization as in Gao+ 2022, its actual use within RLHF algorithms seems impractical. - The evaluation metric described in Section A.2 is aimed at respecting both the METEOR and intent reward functions in light of Goodhart's Law. While the authors have me

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

+ This paper is extremely well-written and contains sufficient related work and explanation to understand the proposed techniques and prior work. + To the best of my knowledge, the proposed technique to avoid LLM overoptimization toward composite RMs is novel. + The analysis is very detailed and showcases the proposed technique works as expected + As LLMs are a very popular topic now and actively being deployed, this approach tackles an important issue in optimizing and improving LLM performanc

Weaknesses

- Could you comment on how easy it is to identify the proxy point? In larger or multi-topic datasets, would determining a proxy point be more difficult? - While the paper does contain much information, a lot of important information is in the appendix that would benefit from also appearing briefly in the paper. On that thread, it would be beneficial to highlight important details in D.2. Is there a specific feature of the sample outputs you are attempting to highlight?

Code & Models

Repositories

tedmoskovitz/constrainedrl4lms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications