Human Alignment of Large Language Models through Online Preference   Optimisation

Daniele Calandriello; Daniel Guo; Remi Munos; Mark Rowland; Yunhao; Tang; Bernardo Avila Pires; Pierre Harvey Richemond; Charline Le Lan; Michal; Valko; Tianqi Liu; Rishabh Joshi; Zeyu Zheng; Bilal Piot

arXiv:2403.08635·cs.LG·March 14, 2024·2 cites

Human Alignment of Large Language Models through Online Preference Optimisation

Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao, Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal, Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

PDF

Open Access 1 Datasets

TL;DR

This paper demonstrates the equivalence between two recent human alignment methods for large language models, introduces a generalized algorithm IPO-MD leveraging online preference data, and compares its performance with existing methods on summarization tasks.

Contribution

It proves the equivalence between IPO and Nash-MD methods, introduces IPO-MD as a new generalized online alignment algorithm, and evaluates its effectiveness against other online preference optimization techniques.

Findings

01

IPO and Nash-MD are equivalent when considering online data.

02

IPO-MD outperforms some existing online preference methods.

03

The equivalence enables new insights into online human alignment strategies.

Abstract

Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsDirect Preference Optimization