Unpacking DPO and PPO: Disentangling Best Practices for Learning from   Preference Feedback

Hamish Ivison; Yizhong Wang; Jiacheng Liu; Zeqiu Wu and; Valentina Pyatkin; Nathan Lambert; Noah A. Smith; Yejin Choi and; Hannaneh Hajishirzi

arXiv:2406.09279·cs.CL·October 10, 2024·3 cites

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu and, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi and, Hannaneh Hajishirzi

PDF

Open Access 2 Repos 10 Models 2 Datasets

TL;DR

This paper systematically investigates how different components of preference-based learning affect language model performance, highlighting the importance of data quality, algorithms, reward models, and prompts, with practical recommendations and empirical results.

Contribution

It provides a comprehensive analysis of preference learning components, compares DPO and PPO, and offers a recipe for effective preference-based training of language models.

Findings

01

Preference data quality has the largest impact on performance.

02

PPO outperforms DPO by up to 2.5% in math domains.

03

High-quality preference data improves instruction following by up to 8%.

Abstract

Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference

MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization