MaxMin-RLHF: Alignment with Diverse Human Preferences
Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang,, Dinesh Manocha, Amrit Singh Bedi, and Mengdi Wang

TL;DR
This paper introduces MaxMin-RLHF, a novel reinforcement learning approach that better captures diverse human preferences by learning a mixture of preference distributions, improving fairness and robustness in language model alignment.
Contribution
It proposes a MaxMin alignment objective based on social choice theory, addressing the limitations of single reward models in representing diverse preferences.
Findings
Achieves over 16% improvement in win-rates compared to traditional RLHF.
Enhances minority group accuracy by over 33% without affecting majority groups.
Demonstrates robustness and fairness across small and large-scale language models.
Abstract
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. To provide an equitable solution to the problem, we learn a mixture of preference distributions via an expectation-maximization algorithm and propose a MaxMin alignment objective for policy learning inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences. We elucidate the connection of our proposed approach to distributionally robust optimization and general utility RL, thereby…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
