The Crucial Role of Samplers in Online Direct Preference Optimization
Ruizhe Shi, Runlong Zhou, Simon S. Du

TL;DR
This paper analyzes how different sampling strategies affect the convergence of Direct Preference Optimization (DPO) in language model alignment, revealing that tailored samplers significantly improve convergence rates and practical performance.
Contribution
The paper provides a rigorous theoretical analysis of DPO's convergence with various samplers and introduces an improved online sampler that enhances real-world performance.
Findings
Uniform sampling achieves linear convergence.
Proposed online sampler achieves quadratic convergence.
Outperforms vanilla DPO by over 7.4% on Safe-RLHF dataset.
Abstract
Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves convergence, while our proposed online sampler achieves convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over % on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but…
Peer Reviews
Decision·ICLR 2025 Poster
1. Theoretical Rigor: The authors provide a comprehensive theoretical analysis of DPO convergence with various samplers, adding clarity to an underexplored aspect of preference optimization. 2. Practical Enhancements: The proposed samplers improve DPO's performance, demonstrating notable advantages over baseline approaches on empirical datasets. 3. Insightful Implications: The work not only proposes new samplers but also reinterprets existing DPO methods within their framework, offering a broa
1. The experiments are not valid enough to test the performance of their method. First, in Table 2, the model is scored by the same reward function used for the training set. In this way, the improvement is likely to come from overfitting. Hence, I suggest the authors to test their performance by using gpt-4o. 2. The analysis lacks the intuition of the specific choice of the mixed sampler, such as why in Line 226, $\pi^s1$ and $\pi^s2$ should have the form of $\exp(r)$ and $\exp(-r)$. Is the wa
1. By developing a general framework of mixtures of heterogeneous sampling strategies, the paper can unify some existing methods. 2. The theoretical results show a separation in convergence rates that is quite unexplored in this area. 3. Empirical evaluations seem to align with theoretical results, showing that the analysis in this paper is promising in improving RLHF.,
1. The mixed samplers in definition 4&5 differ from standard samplers in two aspects: first they consider a heterogenous sampling scheme (enhancer) that increases the difference between the positive completion and the negative completion, second they mix the heterogenous sampling scheme with the standard (homogenous) sampling scheme using some nontrivial mixing coefficient. In the theoretical study, it is shown that the two aspects combined have certain benefits. However, overall there is a lac
The problem the authors study in this paper is an important and timely one. The setting in the paper (essentially finite-armed bandits) is admittedly very stylized, but I found the theoretical results to be interesting and non-trivial, and I can imagine that they might serve as a useful starting point to study tradeoffs around sampling in online alignment for more complex/challenging settings. I generally found the paper to be well-written and easy to follow.
The main limitations of the paper concern the simple/stylized nature of the bandit setting the authors study. - The authors restrict their attention to the setting where the response space is small/finite, which allows for uniform sampling, and neglects the problem of *exploration*, which is critical for large response spaces. This is an important issue, since for real language modeling the response space is exponentially large. - The authors, by focusing on the bandit setting, do not consider
Code & Models
Videos
Taxonomy
TopicsConsumer Market Behavior and Pricing
MethodsDirect Preference Optimization
