The Importance of Online Data: Understanding Preference Fine-tuning via Coverage
Yuda Song, Gokul Swamy, Aarti Singh, J. Andrew Bagnell, Wen Sun

TL;DR
This paper analyzes the differences between online reinforcement learning and offline contrastive methods for preference fine-tuning of large language models, emphasizing the role of dataset coverage and proposing a hybrid optimization algorithm.
Contribution
It provides a theoretical analysis of coverage conditions affecting convergence of preference fine-tuning methods and introduces HyPO, a hybrid approach combining offline and online data.
Findings
Offline methods require full coverage for optimal convergence.
Online methods need only partial coverage, explaining their better performance in limited data scenarios.
HyPO outperforms pure offline methods like DPO in experiments.
Abstract
Learning from human preference data has emerged as the dominant paradigm for fine-tuning large language models (LLMs). The two most common families of techniques -- online reinforcement learning (RL) such as Proximal Policy Optimization (PPO) and offline contrastive methods such as Direct Preference Optimization (DPO) -- were positioned as equivalent in prior work due to the fact that both have to start from the same offline preference dataset. To further expand our theoretical understanding of the similarities and differences between online and offline techniques for preference fine-tuning, we conduct a rigorous analysis through the lens of dataset coverage, a concept that captures how the training data covers the test distribution and is widely used in RL. We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
MethodsDirect Preference Optimization
