Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu and, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi and, Hannaneh Hajishirzi

TL;DR
This paper systematically investigates how different components of preference-based learning affect language model performance, highlighting the importance of data quality, algorithms, reward models, and prompts, with practical recommendations and empirical results.
Contribution
It provides a comprehensive analysis of preference learning components, compares DPO and PPO, and offers a recipe for effective preference-based training of language models.
Findings
Preference data quality has the largest impact on performance.
PPO outperforms DPO by up to 2.5% in math domains.
High-quality preference data improves instruction following by up to 8%.
Abstract
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗allenai/tulu-v2.5-dpo-13b-uf-meanmodel· 20 dl20 dl
- 🤗allenai/tulu-v2.5-dpo-13b-argilla-orca-pairsmodel· 12 dl12 dl
- 🤗allenai/tulu-v2.5-dpo-13b-helpsteermodel· 16 dl16 dl
- 🤗allenai/tulu-v2.5-dpo-13b-shp2model· 16 dl16 dl
- 🤗allenai/tulu-v2.5-dpo-13b-stackexchangemodel· 11 dl11 dl
- 🤗allenai/tulu-v2.5-dpo-13b-uf-overallmodel· 15 dl15 dl
- 🤗allenai/tulu-v2.5-dpo-13b-capybaramodel· 13 dl13 dl
- 🤗allenai/tulu-v2.5-dpo-13b-prm-phase-2model· 16 dl16 dl
- 🤗allenai/tulu-v2.5-dpo-13b-hh-rlhfmodel· 24 dl· ♡ 124 dl♡ 1
- 🤗allenai/tulu-v2.5-dpo-13b-nectarmodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference
MethodsDirect Preference Optimization · Entropy Regularization · Proximal Policy Optimization
