TL;DR
This paper introduces Linear Preference Optimization (LPO), a new alignment framework that enhances stability and control in preference optimization tasks by decoupling gradients and implementing rejection suppression.
Contribution
LPO offers a novel gradient decoupling method, stability improvements with offset constraints, and controllable rejection suppression, advancing preference alignment techniques.
Findings
LPO improves performance across text, math, and TTS tasks.
LPO demonstrates robustness and tunability in preference alignment.
Extensive experiments validate the effectiveness of LPO.
Abstract
DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability. However, DPO is prone to overfitting and collapse. To address these challenges, we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The experiments are conducted in lots of different tasks including reasoning, alignment and speech recognition tasks.
I'm concerned about the novelty of this work. 1. This work mainly changes the DPO loss function, but this is mostly a direct combination of existing works including the designs in IPO, SimPO and DPOP. There is no new idea in the final loss functions. Also, I'm not sure if STE really helps. Changing nonlinear Softmax to linear absolute value function, seems the gradient is totally equivalent to the case without the STE...... 2. For all the combination from the designs of ~4 existing works, th
S1. The limitations of DPO are well characterized mathematically, providing a much needed theoretical background to the method. S2. Given the well defined limitations, the authors proposed sound and targeted fixes to the objective.
W1. The experimental part is missing critical baselines to compare the contributions of the proposed additions or modifications in the loss. For instance, margin-preserving or offset oriented (SimPO, ODPO), DPOP, Identity PO. For TTS and ASR sections, DPO as baseline would be needed at least. W2. The described strategy for construction of the preference pair dataset is not a novel idea but it describes the method proposed by SPIN. Proper attribution should be given and the writing should be cor
1. The paper addresses a well-known issue in DPO, i.e., the over-suppression of rejected responses and the resulting instability in preference alignment. 2. The proposed modification is simple and easy to implement.
I have the following concerns. *If the authors could properly address them during the rebuttal phase, I am willing to raise my score.* 1. The technical novelty of LPO is limited. Most design choices, such as linearizing the DPO objective, adding offsets, and detaching gradients, appear incremental and heuristic. This paper lacks theoretical justification or principled analysis to explain why these modifications improve alignment performance. 2. Comparisons are mostly restricted to SFT and vanil
In this paper, I particularly enjoyed the abstract, introduction, and methods section. In my opinion, the problem is well motivated, and so is the methodology that the authors introduce. Moreover, I found the methods well-explained and easy to follow.
While I found the method section well-motivated, I believe that this paper currently lacks evidence to support the claims made in the introduction and the challenge in general, as displayed in the experiments. Overall, I think the authors can improve the paper on the following things, and I look forward to discussing this with them: - **No statistical significance reported in experiments** My primary concern with all the provided experiments is the lack of confidence intervals, standard errors
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
