The Perfect Blend: Redefining RLHF with Mixture of Judges

Tengyu Xu; Eryk Helenowski; Karthik Abinav Sankararaman; Di Jin,; Kaiyan Peng; Eric Han; Shaoliang Nie; Chen Zhu; Hejia Zhang; Wenxuan Zhou,; Zhouhao Zeng; Yun He; Karishma Mandyam; Arya Talabzadeh; Madian Khabsa,; Gabriel Cohen; Yuandong Tian; Hao Ma; Sinong Wang; Han Fang

arXiv:2409.20370·cs.LG·October 1, 2024·2 cites

The Perfect Blend: Redefining RLHF with Mixture of Judges

Tengyu Xu, Eryk Helenowski, Karthik Abinav Sankararaman, Di Jin,, Kaiyan Peng, Eric Han, Shaoliang Nie, Chen Zhu, Hejia Zhang, Wenxuan Zhou,, Zhouhao Zeng, Yun He, Karishma Mandyam, Arya Talabzadeh, Madian Khabsa,, Gabriel Cohen, Yuandong Tian, Hao Ma, Sinong Wang, Han Fang

PDF

Open Access 3 Datasets 3 Reviews

TL;DR

This paper introduces CGPO, a novel RLHF method using Mixture of Judges, which effectively mitigates reward hacking and optimizes multiple objectives, significantly improving LLM fine-tuning across diverse tasks.

Contribution

The paper presents Constrained Generative Policy Optimization (CGPO), a new post-training paradigm with Mixture of Judges for principled multi-objective RLHF without extensive tuning.

Findings

01

CGPO outperforms PPO and DPO on various tasks.

02

CGPO reduces reward hacking in coding benchmarks.

03

CGPO achieves up to 12.5% improvement in STEM tasks.

Abstract

Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations. This is often done via human intuition and does not generalize. In this work, we introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO). The core of CGPO is Mixture of Judges (MoJ) with cost-efficient constrained policy optimization with stratification, which can identify the perfect blend in RLHF in a principled manner. It shows strong empirical results with theoretical guarantees, does not require…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

1. The authors proposed a new training paradigm for RLHF which handles multi-objective and multi-constraints, which can address several limitations of existing paradigms. The proposed paradigm is highly modular and has the potential to scale to more complicated future scenarios. 2. The experimental design is thoughtful, spanning a comprehensive set of tasks and constraints. The authors performed extensive evaluations and demonstrated the effectiveness of CGPO.

Weaknesses

1. I appreciate the thoughtful design of the multi-objective and the multi-constraints, as well as the customizability of the framework. But there are a few natural questions following these: a) For all these "customized combinations" (line 292), "tailor the specific reward model to be applied for each task" (line 336), "uniquely tailored for each task" (line 347), "specifically tailored hyperparameter setup" (line 353) -- it's unclear how these design choices are made, and if one can repro

Reviewer 02Rating 6Confidence 4

Strengths

- The manuscript is well-written; and the proposed algorithm is straightforward to understand. - This work considers the constrained multi-objective alignment setting, which is novel and important. Also, compared to previous works which usually specify probabilistic constraints, this work specifies a stricter constraint, i.e., $P_{s \sim \mathcal{D}, a\sim\pi_{w}}((s,a) \in \Sigma) \geq 1$. - According to Table 2, the proposed algorithm(s) outperform their DPO, PPO baselines.

Weaknesses

- While the work studies a constrained optimization problem, the constrain satisfaction in Equation (3) is not verified. - In line 109, this work claims that the proposed method could avoid compromises due to conflicting goals from other tasks; however, in Algorithm 2, in the parameter updating step (step 6), conflicting goals might induce conflicting gradients ($\tilde g_l(\pi_{w_{t}})$), which stills leads to compromises. - Since the MoJ is involved in the training process, the cost of LLM ca

Reviewer 03Rating 3Confidence 4

Strengths

The authors use a variety of tasks to demonstrate their method. They also implemented several variants of the CGPO framework. The idea is simple, results look promising.

Weaknesses

Some parts of the paper are not easy to follow. In the paragraph begining at line 77, I would recommend the authors to use a Figure 1 to explain their technical contribution rather than the illustrative MoJ, which is very clear - yet the primary-dual constrained optimization part could be further explained with a "figure 1" Assuming prior knowledge to reward models / how could the hacks be is too strong and lacks a ground. The paper claims on improving pareto frontier at the intro section but

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsJudicial and Constitutional Studies · Legal Education and Practice Innovations · Legal Systems and Judicial Processes

MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization