Generalized Preference Optimization: A Unified Approach to Offline   Alignment

Yunhao Tang; Zhaohan Daniel Guo; Zeyu Zheng; Daniele Calandriello,; R\'emi Munos; Mark Rowland; Pierre Harvey Richemond; Michal Valko; Bernardo; \'Avila Pires; Bilal Piot

arXiv:2402.05749·cs.LG·May 30, 2024·1 cites

Generalized Preference Optimization: A Unified Approach to Offline Alignment

Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello,, R\'emi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo, \'Avila Pires, Bilal Piot

PDF

Open Access 1 Datasets

TL;DR

This paper introduces generalized preference optimization (GPO), a unified framework for offline preference-based model fine-tuning that encompasses existing methods and provides new insights into regularization effects and algorithmic trade-offs.

Contribution

The paper proposes GPO, a flexible family of offline preference optimization algorithms that unify existing methods and offer new variants, enhancing understanding of regularization in offline alignment.

Findings

01

GPO unifies existing offline preference optimization algorithms.

02

Different GPO variants balance regularization and performance similarly.

03

The choice of convex function influences regularization effects and algorithm behavior.

Abstract

Offline preference optimization allows fine-tuning large models directly from offline data, and has proved effective in recent alignment practices. We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions. GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases, while naturally introducing new variants. The GPO framework also sheds light on how offline algorithms enforce regularization, through the design of the convex function that defines the loss. Our analysis and experiments reveal the connections and subtle differences between the offline regularization and the KL divergence regularization intended by the canonical RLHF formulation. In a controlled setting akin to Gao et al 2023, we also show that different GPO variants achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

misovalko/my-research-papers
dataset· 21 dl
21 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Multi-Criteria Decision Making

MethodsDirect Preference Optimization