Coverage Improvement and Fast Convergence of On-policy Preference Learning

Juno Kim; Jihun Yun; Jason D. Lee; Kwang-Sung Jun

arXiv:2601.08421·cs.LG·January 14, 2026

Coverage Improvement and Fast Convergence of On-policy Preference Learning

Juno Kim, Jihun Yun, Jason D. Lee, Kwang-Sung Jun

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of on-policy preference learning algorithms, demonstrating their rapid convergence due to coverage improvement, and introduces new methods that outperform off-policy approaches in language model alignment.

Contribution

It introduces the coverage improvement principle, proves exponential convergence of on-policy DPO under certain conditions, and proposes a hybrid sampling method for faster convergence.

Findings

01

On-policy DPO converges exponentially with sufficient batch size.

02

Hybrid sampler guarantees convergence in just two rounds.

03

On-policy methods outperform off-policy counterparts in experiments.

Abstract

Online on-policy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy's coverage evolves throughout on-policy training. We propose and rigorously justify the \emph{coverage improvement principle}: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms