VPO: Leveraging the Number of Votes in Preference Optimization
Jae Hyeon Cho, Minkyung Park, Byung-Jun Lee

TL;DR
This paper introduces VPO, a novel preference optimization framework that leverages vote counts in human preference data to improve language model training, outperforming existing methods.
Contribution
The paper proposes a new method, VPO, that uses vote counts with Bayesian MMSE estimation to better model preferences and enhance language model training.
Findings
VPO outperforms existing preference optimization algorithms.
Vote count information improves preference modeling accuracy.
Extensions of DPO and IPO with VPO yield better results.
Abstract
Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms
MethodsALIGN · Balanced Selection · Direct Preference Optimization
