Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling
Yuxuan Tang, Yifan Feng

TL;DR
This paper introduces Ranked Choice Preference Optimization (RCPO), a framework that enhances LLM alignment by utilizing richer human feedback like multiway rankings, outperforming traditional pairwise methods.
Contribution
The paper presents RCPO, a unified framework that incorporates ranked choice modeling into LLM training, supporting various models and outperforming existing methods.
Findings
RCPO outperforms baseline methods on multiple LLMs.
Leveraging ranked preferences improves alignment effectiveness.
The framework is extensible for richer feedback formats.
Abstract
Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiway comparisons and top- rankings. We introduce Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. RCPO supports both utility-based and rank-based models, subsumes several pairwise methods (such as DPO and SimPO) as special cases, and provides principled training objectives for richer feedback formats. We instantiate this framework with two representative models (Multinomial Logit and Mallows-RMJ). Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
