Coverage Improvement and Fast Convergence of On-policy Preference Learning
Juno Kim, Jihun Yun, Jason D. Lee, Kwang-Sung Jun

TL;DR
This paper provides a theoretical analysis of on-policy preference learning algorithms, demonstrating their rapid convergence due to coverage improvement, and introduces new methods that outperform off-policy approaches in language model alignment.
Contribution
It introduces the coverage improvement principle, proves exponential convergence of on-policy DPO under certain conditions, and proposes a hybrid sampling method for faster convergence.
Findings
On-policy DPO converges exponentially with sufficient batch size.
Hybrid sampler guarantees convergence in just two rounds.
On-policy methods outperform off-policy counterparts in experiments.
Abstract
Online on-policy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy's coverage evolves throughout on-policy training. We propose and rigorously justify the \emph{coverage improvement principle}: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
