Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Biqing Qi; Pengfei Li; Fangyuan Li; Junqi Gao; Kaiyan Zhang; Bowen; Zhou

arXiv:2406.05534·cs.AI·June 11, 2024

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Biqing Qi, Pengfei Li, Fangyuan Li, Junqi Gao, Kaiyan Zhang, Bowen, Zhou

PDF

Open Access

TL;DR

This paper introduces OFS-DPO, a novel online preference optimization method inspired by species competition, which enhances large language model alignment and mitigates catastrophic forgetting across domains.

Contribution

It proposes a fast-slow chasing framework with LoRA modules and a new regularization, improving continual preference alignment in LLMs across multiple domains.

Findings

01

OFS-DPO outperforms traditional DPO in in-domain tasks.

02

COFS-DPO achieves superior cross-domain continual learning.

03

The method effectively mitigates catastrophic forgetting.

Abstract

Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of cross-domain human preferences, direct continual training can lead to catastrophic forgetting, limiting DPO's performance and efficiency. Inspired by intraspecific competition driving species evolution, we propose a Online Fast-Slow chasing DPO (OFS-DPO) for preference alignment, simulating competition through fast and slow chasing among models to facilitate rapid adaptation. Specifically, we first derive the regret upper bound for online learning, validating our motivation with a min-max optimization pattern. Based on this, we introduce two identical modules using Low-rank Adaptive (LoRA) with different optimization speeds to simulate intraspecific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuction Theory and Applications · Data Management and Algorithms · Optimization and Search Problems

MethodsDirect Preference Optimization