Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

Rui Wang; Qianguo Sun; Chao Song; Junlong Wu; Tianrong Chen; Zhiyun Zeng; Yu Li

arXiv:2508.14947·cs.LG·August 26, 2025

Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

Rui Wang, Qianguo Sun, Chao Song, Junlong Wu, Tianrong Chen, Zhiyun Zeng, Yu Li

PDF

4 Reviews

TL;DR

This paper introduces Linear Preference Optimization (LPO), a new alignment framework that enhances stability and control in preference optimization tasks by decoupling gradients and implementing rejection suppression.

Contribution

LPO offers a novel gradient decoupling method, stability improvements with offset constraints, and controllable rejection suppression, advancing preference alignment techniques.

Findings

01

LPO improves performance across text, math, and TTS tasks.

02

LPO demonstrates robustness and tunability in preference alignment.

03

Extensive experiments validate the effectiveness of LPO.

Abstract

DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability. However, DPO is prone to overfitting and collapse. To address these challenges, we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

The experiments are conducted in lots of different tasks including reasoning, alignment and speech recognition tasks.

Weaknesses

I'm concerned about the novelty of this work. 1. This work mainly changes the DPO loss function, but this is mostly a direct combination of existing works including the designs in IPO, SimPO and DPOP. There is no new idea in the final loss functions. Also, I'm not sure if STE really helps. Changing nonlinear Softmax to linear absolute value function, seems the gradient is totally equivalent to the case without the STE...... 2. For all the combination from the designs of ~4 existing works, th

Reviewer 02Rating 0Confidence 4

Strengths

S1. The limitations of DPO are well characterized mathematically, providing a much needed theoretical background to the method. S2. Given the well defined limitations, the authors proposed sound and targeted fixes to the objective.

Weaknesses

W1. The experimental part is missing critical baselines to compare the contributions of the proposed additions or modifications in the loss. For instance, margin-preserving or offset oriented (SimPO, ODPO), DPOP, Identity PO. For TTS and ASR sections, DPO as baseline would be needed at least. W2. The described strategy for construction of the preference pair dataset is not a novel idea but it describes the method proposed by SPIN. Proper attribution should be given and the writing should be cor

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper addresses a well-known issue in DPO, i.e., the over-suppression of rejected responses and the resulting instability in preference alignment. 2. The proposed modification is simple and easy to implement.

Weaknesses

I have the following concerns. *If the authors could properly address them during the rebuttal phase, I am willing to raise my score.* 1. The technical novelty of LPO is limited. Most design choices, such as linearizing the DPO objective, adding offsets, and detaching gradients, appear incremental and heuristic. This paper lacks theoretical justification or principled analysis to explain why these modifications improve alignment performance. 2. Comparisons are mostly restricted to SFT and vanil

Reviewer 04Rating 2Confidence 3

Strengths

In this paper, I particularly enjoyed the abstract, introduction, and methods section. In my opinion, the problem is well motivated, and so is the methodology that the authors introduce. Moreover, I found the methods well-explained and easy to follow.

Weaknesses

While I found the method section well-motivated, I believe that this paper currently lacks evidence to support the claims made in the introduction and the challenge in general, as displayed in the experiments. Overall, I think the authors can improve the paper on the following things, and I look forward to discussing this with them: - **No statistical significance reported in experiments** My primary concern with all the provided experiments is the lack of confidence intervals, standard errors

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.