Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing
Biqing Qi, Pengfei Li, Fangyuan Li, Junqi Gao, Kaiyan Zhang, Bowen, Zhou

TL;DR
This paper introduces OFS-DPO, a novel online preference optimization method inspired by species competition, which enhances large language model alignment and mitigates catastrophic forgetting across domains.
Contribution
It proposes a fast-slow chasing framework with LoRA modules and a new regularization, improving continual preference alignment in LLMs across multiple domains.
Findings
OFS-DPO outperforms traditional DPO in in-domain tasks.
COFS-DPO achieves superior cross-domain continual learning.
The method effectively mitigates catastrophic forgetting.
Abstract
Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of cross-domain human preferences, direct continual training can lead to catastrophic forgetting, limiting DPO's performance and efficiency. Inspired by intraspecific competition driving species evolution, we propose a Online Fast-Slow chasing DPO (OFS-DPO) for preference alignment, simulating competition through fast and slow chasing among models to facilitate rapid adaptation. Specifically, we first derive the regret upper bound for online learning, validating our motivation with a min-max optimization pattern. Based on this, we introduce two identical modules using Low-rank Adaptive (LoRA) with different optimization speeds to simulate intraspecific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuction Theory and Applications · Data Management and Algorithms · Optimization and Search Problems
MethodsDirect Preference Optimization
