MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty; Jiahao Qiu; Hui Yuan; Alec Koppel; Furong Huang,; Dinesh Manocha; Amrit Singh Bedi; and Mengdi Wang

arXiv:2402.08925·cs.CL·December 30, 2024·2 cites

MaxMin-RLHF: Alignment with Diverse Human Preferences

Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang,, Dinesh Manocha, Amrit Singh Bedi, and Mengdi Wang

PDF

Open Access

TL;DR

This paper introduces MaxMin-RLHF, a novel reinforcement learning approach that better captures diverse human preferences by learning a mixture of preference distributions, improving fairness and robustness in language model alignment.

Contribution

It proposes a MaxMin alignment objective based on social choice theory, addressing the limitations of single reward models in representing diverse preferences.

Findings

01

Achieves over 16% improvement in win-rates compared to traditional RLHF.

02

Enhances minority group accuracy by over 33% without affecting majority groups.

03

Demonstrates robustness and fairness across small and large-scale language models.

Abstract

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis