Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Idan Pipano; Shoham Sabach; Kavosh Asadi; Mohammad Ghavamzadeh

arXiv:2602.06788·cs.LG·February 9, 2026

Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh

PDF

Open Access 3 Reviews

TL;DR

This paper extends DPO algorithms by exploring nonconvex $f$-divergences, introducing the SquaredPO loss that is displacement-resistant and offers improved theoretical guarantees with competitive practical performance.

Contribution

It generalizes the conditions for tractability of RLHF optimization beyond convex $f$-divergences and introduces a new displacement-resistant $f$, leading to the novel SquaredPO loss.

Findings

01

SquaredPO performs competitively in practice.

02

Displacement-resistant $f$-divergences prevent probability displacement.

03

Theoretical guarantees are strengthened with the new loss.

Abstract

DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$ -divergence with a convex generating function $f$ . Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

The perspective of looking at $f$-DPO is novel, with non-trivial theoretical results, which also yields a new DPO variant that is theoretically and empirically better than vanilla DPO. Many benchmarks are used in the experiments. There seem to be sufficient details to reproduce the experiments.

Weaknesses

Figure 4 does not show significant performance difference between DPO and the proposed SQUAREDPO. The paper could be better if (1) Sufficient conditions for displacement-resistancy could be provided. (2) $f$-DPO with more choices of $f$ could be empirically compared, such as $\chi^2$.

Reviewer 02Rating 4Confidence 4

Strengths

Analysis - The paper provides a thorough analysis of the properties of different objectives and characterizes a wide range of possible objective functions. They also provide direct empirical comparisons of changes in likelihood as well as win rate and benchmark performance. Clarity - The paper provides a clear presentation of ideas with detailed explanations. The theoretical definitions and interpretations walk through the key ideas and the experimental setup is well described and provides sup

Weaknesses

Contributions - While the analysis is thorough and claims made are well supported, there is a lack of comparison to other methods that aim to achieve the same goal and it is unclear whether the convexity constraint is an issue. Figure 1 shows that there are multiple convex functions which are DPO-inducing and displacement-resistant, many of which have already been explored and have successfully mitigated over-optimization of displacement. As a result, without further comparison to these existing

Reviewer 03Rating 4Confidence 3

Strengths

- Rigorous theoretical analysis. - Easy to adapt loss (only regularization term is changed yet the validity is proved)

Weaknesses

- Experiment is narrow.(model is only llama, the only baseline is naiveDPO) - In experiment, the performance gain is marginal or even similar to naive DPO. Specifically, if the same performance for squareDPO be achieved with epoch 4 as DPO with epoch 1(Table1), why we should use SquareDPO? - No performance analysis without LORA

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Reinforcement Learning in Robotics · Natural Language Processing Techniques