Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Pietro Bernardelle; Gianluca Demartini

arXiv:2410.16586·cs.AI·October 23, 2024

Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Pietro Bernardelle, Gianluca Demartini

PDF

TL;DR

This paper evaluates how Direct Preference Optimization (DPO) can efficiently fine-tune Large Language Models with limited preference data, highlighting the importance of data diversity and prompt type for improved alignment.

Contribution

It systematically compares data efficiency of DPO across various preference datasets and prompt types, providing insights for optimal preference data utilization in LLM fine-tuning.

Findings

01

Increasing preference data improves model performance.

02

Diverse datasets enhance effectiveness.

03

Conversational prompts outperform question-answer prompts.

Abstract

Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant advancements in LLM alignment techniques, the impact of different type of preference data on model performance has yet to be systematically explored. In this study, we investigate the scalability, data efficiency, and effectiveness of Direct Preference Optimization (DPO) in fine-tuning pre-trained LLMs, aiming to reduce their dependency on extensive amounts of preference data, which is expensive to collect. We (1) systematically compare the performance of models fine-tuned with varying percentages of a combined preference judgement dataset to define the improvement curve of DPO and assess its effectiveness in data-constrained environments; and (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDirect Preference Optimization