VPO: Leveraging the Number of Votes in Preference Optimization

Jae Hyeon Cho; Minkyung Park; Byung-Jun Lee

arXiv:2410.22891·cs.LG·October 31, 2024

VPO: Leveraging the Number of Votes in Preference Optimization

Jae Hyeon Cho, Minkyung Park, Byung-Jun Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces VPO, a novel preference optimization framework that leverages vote counts in human preference data to improve language model training, outperforming existing methods.

Contribution

The paper proposes a new method, VPO, that uses vote counts with Bayesian MMSE estimation to better model preferences and enhance language model training.

Findings

01

VPO outperforms existing preference optimization algorithms.

02

Vote count information improves preference modeling accuracy.

03

Extensions of DPO and IPO with VPO yield better results.

Abstract

Direct Preference Optimization (DPO) trains a language model using human preference data, bypassing the explicit reward modeling phase of Reinforcement Learning from Human Feedback (RLHF). By iterating over sentence pairs in a preference dataset, DPO enhances generation quality by increasing the likelihood of producing preferred sentences over less favored ones. Preference datasets are typically created by selecting preferred sentences through a voting process involving multiple individuals, as opinions can vary due to the subjective nature of human preferences. While the number of votes offers insight into whether a sentence pair is clearly preferable or controversial, current methods do not fully leverage this information. In this paper, we introduce a technique that leverages user voting data to better align with diverse subjective preferences. We employ the Bayesian Minimum Mean…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ku-dmlab/vpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms

MethodsALIGN · Balanced Selection · Direct Preference Optimization