Holistic Utility Preference Learning for Listwise Alignment
Jiacong Zhou, Xianyun Wang, Min Zhang, Jun Yu

TL;DR
This paper presents DRPO, a listwise learning-to-rank method using differentiable NDCG to improve alignment of language models with human preferences, outperforming pairwise approaches.
Contribution
The paper introduces DRPO, a novel listwise preference optimization method that leverages holistic list rankings and differentiable NDCG for better alignment.
Findings
DRPO outperforms existing pairwise methods in response quality.
The diffNDCG loss enables end-to-end training with NDCG.
Adaptive Rank Policy Score improves response discriminability.
Abstract
Aligning large language models with human preferences is essential for improving interaction quality and safety by ensuring outputs better reflect human values. A promising strategy involves Reinforcement Learning from Human Feedback (RLHF), starting with collecting and ranking responses generated by a supervised fine-tuning model to refine alignment. Existing methods such as Direct Preference Optimization (DPO) focus on pairwise comparisons, categorizing responses into preferred and less preferred pairs and optimizing pairwise margins. However, this pairwise approach cannot capture the holistic ranking relationships among multiple responses or effectively leverage the rich preference information available in list-wise comparisons. To address this challenge, this paper introduces \underline{D}irect \underline{R}anking \underline{P}reference \underline{O}ptimization (DRPO), a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms
MethodsFocus
