Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences
Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh

TL;DR
This paper extends DPO algorithms by exploring nonconvex $f$-divergences, introducing the SquaredPO loss that is displacement-resistant and offers improved theoretical guarantees with competitive practical performance.
Contribution
It generalizes the conditions for tractability of RLHF optimization beyond convex $f$-divergences and introduces a new displacement-resistant $f$, leading to the novel SquaredPO loss.
Findings
SquaredPO performs competitively in practice.
Displacement-resistant $f$-divergences prevent probability displacement.
Theoretical guarantees are strengthened with the new loss.
Abstract
DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of -divergence with a convex generating function . Our first contribution is to show that convexity of is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to…
Peer Reviews
Decision·ICLR 2026 Poster
The perspective of looking at $f$-DPO is novel, with non-trivial theoretical results, which also yields a new DPO variant that is theoretically and empirically better than vanilla DPO. Many benchmarks are used in the experiments. There seem to be sufficient details to reproduce the experiments.
Figure 4 does not show significant performance difference between DPO and the proposed SQUAREDPO. The paper could be better if (1) Sufficient conditions for displacement-resistancy could be provided. (2) $f$-DPO with more choices of $f$ could be empirically compared, such as $\chi^2$.
Analysis - The paper provides a thorough analysis of the properties of different objectives and characterizes a wide range of possible objective functions. They also provide direct empirical comparisons of changes in likelihood as well as win rate and benchmark performance. Clarity - The paper provides a clear presentation of ideas with detailed explanations. The theoretical definitions and interpretations walk through the key ideas and the experimental setup is well described and provides sup
Contributions - While the analysis is thorough and claims made are well supported, there is a lack of comparison to other methods that aim to achieve the same goal and it is unclear whether the convexity constraint is an issue. Figure 1 shows that there are multiple convex functions which are DPO-inducing and displacement-resistant, many of which have already been explored and have successfully mitigated over-optimization of displacement. As a result, without further comparison to these existing
- Rigorous theoretical analysis. - Easy to adapt loss (only regularization term is changed yet the validity is proved)
- Experiment is narrow.(model is only llama, the only baseline is naiveDPO) - In experiment, the performance gain is marginal or even similar to naive DPO. Specifically, if the same performance for squareDPO be achieved with epoch 4 as DPO with epoch 1(Table1), why we should use SquareDPO? - No performance analysis without LORA
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Natural Language Processing Techniques
