Human Alignment of Large Language Models through Online Preference Optimisation
Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao, Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal, Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

TL;DR
This paper demonstrates the equivalence between two recent human alignment methods for large language models, introduces a generalized algorithm IPO-MD leveraging online preference data, and compares its performance with existing methods on summarization tasks.
Contribution
It proves the equivalence between IPO and Nash-MD methods, introduces IPO-MD as a new generalized online alignment algorithm, and evaluates its effectiveness against other online preference optimization techniques.
Findings
IPO and Nash-MD are equivalent when considering online data.
IPO-MD outperforms some existing online preference methods.
The equivalence enables new insights into online human alignment strategies.
Abstract
Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsDirect Preference Optimization
