Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked   Preferences

Pulkit Pattnaik; Rishabh Maheshwary; Kelechi Ogueji; Vikas; Yadav; Sathwik Tejaswi Madhusudhan

arXiv:2403.07230·cs.CL·November 11, 2024·1 cites

Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

Pulkit Pattnaik, Rishabh Maheshwary, Kelechi Ogueji, Vikas, Yadav, Sathwik Tejaswi Madhusudhan

PDF

Open Access 1 Datasets

TL;DR

Curry-DPO introduces a curriculum learning approach to preference-based training of large language models, utilizing multiple preference pairs per prompt to improve alignment with human preferences, resulting in significant performance gains.

Contribution

It systematically incorporates multiple preference pairs into DPO training using curriculum learning, enhancing model alignment beyond standard single-pair methods.

Findings

01

Curry-DPO outperforms standard DPO on multiple benchmarks.

02

Achieves a score of 7.43 on MT-bench with Zephy-7B.

03

Shows up to 7.5% improvement in win rates over standard DPO.

Abstract

Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected response pair per user prompt) to align LLMs to human preferences. In practice, multiple responses can exist for a given prompt with varying quality relative to each other. With availability of such quality ratings for multiple responses, we propose utilizing these responses to create multiple preference pairs for a given prompt. Our work focuses on systematically using the constructed multiple preference pair in DPO training via curriculum learning methodology. In particular, we order these multiple pairs of preference data from easy to hard (emulating curriculum training) according to various criteria. We show detailed comparisons of our proposed approach to the standard single-pair DPO setting. Our method, which we call Curry-DPO consistently shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ServiceNow-AI/Curriculum_DPO_preferences
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsDirect Preference Optimization · ALIGN