Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective
Pietro Bernardelle, Gianluca Demartini

TL;DR
This paper evaluates how Direct Preference Optimization (DPO) can efficiently fine-tune Large Language Models with limited preference data, highlighting the importance of data diversity and prompt type for improved alignment.
Contribution
It systematically compares data efficiency of DPO across various preference datasets and prompt types, providing insights for optimal preference data utilization in LLM fine-tuning.
Findings
Increasing preference data improves model performance.
Diverse datasets enhance effectiveness.
Conversational prompts outperform question-answer prompts.
Abstract
Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant advancements in LLM alignment techniques, the impact of different type of preference data on model performance has yet to be systematically explored. In this study, we investigate the scalability, data efficiency, and effectiveness of Direct Preference Optimization (DPO) in fine-tuning pre-trained LLMs, aiming to reduce their dependency on extensive amounts of preference data, which is expensive to collect. We (1) systematically compare the performance of models fine-tuned with varying percentages of a combined preference judgement dataset to define the improvement curve of DPO and assess its effectiveness in data-constrained environments; and (2)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDirect Preference Optimization
