The Importance of Online Data: Understanding Preference Fine-tuning via   Coverage

Yuda Song; Gokul Swamy; Aarti Singh; J. Andrew Bagnell; Wen Sun

arXiv:2406.01462·cs.LG·July 17, 2024

The Importance of Online Data: Understanding Preference Fine-tuning via Coverage

Yuda Song, Gokul Swamy, Aarti Singh, J. Andrew Bagnell, Wen Sun

PDF

Open Access

TL;DR

This paper analyzes the differences between online reinforcement learning and offline contrastive methods for preference fine-tuning of large language models, emphasizing the role of dataset coverage and proposing a hybrid optimization algorithm.

Contribution

It provides a theoretical analysis of coverage conditions affecting convergence of preference fine-tuning methods and introduces HyPO, a hybrid approach combining offline and online data.

Findings

01

Offline methods require full coverage for optimal convergence.

02

Online methods need only partial coverage, explaining their better performance in limited data scenarios.

03

HyPO outperforms pure offline methods like DPO in experiments.

Abstract

Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications

MethodsDirect Preference Optimization