MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Haoxian Chen; Hanyang Zhao; Henry Lam; David Yao; Wenpin Tang

arXiv:2405.14953·cs.LG·April 21, 2025·1 cites

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao, Wenpin Tang

PDF

Open Access 1 Video 3 Reviews

TL;DR

MallowsPO introduces a dispersion index based on Mallows' preference ranking theory to better characterize human preference diversity, improving fine-tuning of large language models beyond existing methods.

Contribution

The paper develops MallowsPO, a novel preference optimization approach that unifies and enhances DPO by incorporating preference dispersion, leading to improved LLM fine-tuning performance.

Findings

01

MallowsPO improves performance on benchmark tasks.

02

It boosts LC win rate by nearly 2% when combined with SOTA methods.

03

It maintains strong generalization across diverse tasks.

Abstract

Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversity of human preferences. Inspired by Mallows' theory of preference ranking, we develop in this paper a new approach, the MallowsPO. A distinct feature of this approach is a dispersion index, which reflects the dispersion of human preference to prompts. We show that existing DPO models can be reduced to special cases of this dispersion index, thus unified with MallowsPO. More importantly, we demonstrate (empirically) how to use this dispersion index to enhance the performance of DPO in a broad array of benchmark tasks, from synthetic bandit selection to controllable generations and dialogues,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. This paper is clearly written and well structured. The authors provide sufficient analytical and empirical comparisons with previous methods. 2. The contribution to the methodology for fine-tuning large language models is significant. This work proposes a novel preference optimization framework based on Mallows' preference ranking theory beyond the BT setting, which provides a novel perspective on preference modeling. This new method introduces an important component, the dispersion index, t

Weaknesses

None.

Reviewer 02Rating 5Confidence 3

Strengths

This paper presents an extension to DPO with theoretical guarantees and demonstrates practical improvements in LLM fine-tuning on a set of benchmark tasks. The motivation behind is clearly justified and the studied setting is of practical interests. The introduction of preference dispersion through Mallows’ ranking theory is interesting in preference optimization and is presented clearly. The paper could, however, benefit from additional clarity in Section 3.1, where the transition from tradit

Weaknesses

The use of a dispersion index adds interpretability to preference-based tuning, yet further details on how practitioners could leverage these insights in real-world applications could enhance the paper’s practical relevance. While the empirical results are promising, additional insights into the MallowsPO model’s performance under varying preference dispersion scenarios (e.g., low vs. high dispersion prompts) could strengthen the evaluation. This might highlight more specific cases where Mallow

Reviewer 03Rating 8Confidence 3

Strengths

I believe the paper has the following main strengths: - The problem studied (i.e. LLM alignment to preference data) is very relevant - The connection with Mallow's ranking models is, to the best of my knowledge, novel and meaningful - The derived approach is sound - Experiments are somewhat extensive and demonstrate the benefit of the introduced approach

Weaknesses

Weaknesses are: - The authors could do a better job at illustrating qualitative differences between policies trained with DPO and MallowsPO. In particular, one could plot entropy during training, number of tokens, distribution of rewards (when available) to make their case stronger and to understand what are the concrete differences between the two approaches. Part of this is left as future work, but I believe it would be nice to discuss/analyze it in the current paper - The discussion and esti

Videos

MallowsPO: Fine-Tune Your LLM with Preference Dispersions· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Rough Sets and Fuzzy Logic

MethodsDirect Preference Optimization