LiPO: Listwise Preference Optimization through Learning-to-Rank

Tianqi Liu; Zhen Qin; Junru Wu; Jiaming Shen; Misha Khalman; Rishabh; Joshi; Yao Zhao; Mohammad Saleh; Simon Baumgartner; Jialu Liu; Peter J. Liu,; Xuanhui Wang

arXiv:2402.01878·cs.CL·January 28, 2025·1 cites

LiPO: Listwise Preference Optimization through Learning-to-Rank

Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh, Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu,, Xuanhui Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces LiPO, a listwise preference optimization framework for language model alignment that leverages learning-to-rank techniques, demonstrating improved performance over existing pairwise methods like DPO and SLiC.

Contribution

The paper formulates LM alignment as a listwise ranking problem and proposes LiPO-$\lambda$, a novel method that outperforms existing preference optimization approaches.

Findings

01

LiPO-$\lambda$ outperforms DPO variants and SLiC on preference alignment tasks.

02

The listwise approach effectively utilizes ranked response data for better alignment.

03

The study provides a thorough analysis of ranking objectives in LM preference optimization.

Abstract

Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CyberAgentAILab/annotation-efficient-po
pytorch

Videos

LiPO: Listwise Preference Optimization through Learning-to-Rank· underline

Taxonomy

TopicsData Management and Algorithms

MethodsDirect Preference Optimization