Nash Learning from Human Feedback

R\'emi Munos; Michal Valko; Daniele Calandriello; Mohammad Gheshlaghi; Azar; Mark Rowland; Zhaohan Daniel Guo; Yunhao Tang; Matthieu Geist; Thomas; Mesnard; Andrea Michi; Marco Selvi; Sertan Girgin; Nikola Momchev; Olivier; Bachem; Daniel J. Mankowitz; Doina Precup; Bilal Piot

arXiv:2312.00886·stat.ML·June 12, 2024·1 cites

Nash Learning from Human Feedback

R\'emi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi, Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas, Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier, Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper proposes Nash learning from human feedback (NLHF), an innovative approach that models human preferences as a Nash equilibrium, improving the alignment of large language models with human values.

Contribution

It introduces a novel Nash equilibrium-based framework for preference modeling and policy optimization, along with algorithms for both tabular and deep-learning policy representations.

Findings

01

Nash-MD algorithm converges to the Nash equilibrium in tabular settings.

02

Deep learning algorithms effectively fine-tune LLMs for text summarization.

03

NLHF outperforms traditional reward model-based methods in experiments.

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems