Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha, Naidu, Colin White

TL;DR
This paper introduces DPO-Positive, a new loss function that addresses failure modes in preference-based fine-tuning of large language models, leading to improved performance across various tasks and datasets.
Contribution
The authors identify a failure mode in standard DPO and propose DPO-Positive, a novel training method that outperforms DPO and other fine-tuning techniques on multiple benchmarks.
Findings
DPO can reduce the likelihood of preferred examples during training.
DPO-Positive mitigates this issue and improves downstream task performance.
Smaug-72B surpasses 80% accuracy on HuggingFace leaderboard.
Abstract
Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗speakleash/Bielik-Minitron-7B-v3.0-Instructmodel· 3.7k dl· ♡ 173.7k dl♡ 17
- 🤗abacusai/Smaug-34B-v0.1model· 8.3k dl· ♡ 648.3k dl♡ 64
- 🤗abacusai/Smaug-72B-v0.1model· 8.0k dl· ♡ 4678.0k dl♡ 467
- 🤗speakleash/Bielik-1.5B-v3.0-Instructmodel· 1.1k dl· ♡ 141.1k dl♡ 14
- 🤗speakleash/Bielik-4.5B-v3.0-Instructmodel· 1.1k dl· ♡ 291.1k dl♡ 29
- 🤗abacusai/Smaug-Mixtral-v0.1model· 8.3k dl· ♡ 128.3k dl♡ 12
- 🤗blockblockblock/Smaug-72B-v0.1-bpw2.5model· 5 dl5 dl
- 🤗blockblockblock/Smaug-72B-v0.1-bpw3model· 9 dl9 dl
- 🤗blockblockblock/Smaug-72B-v0.1-bpw3.5model· 3 dl3 dl
- 🤗blockblockblock/Smaug-72B-v0.1-bpw3.7model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDecision-Making and Behavioral Economics
MethodsDirect Preference Optimization
