Length-Controlled Margin-Based Preference Optimization without Reference Model

Gengxu Li; Tingyu Xia; Yi Chang; Yuan Wu

arXiv:2502.14643·cs.CL·May 30, 2025

Length-Controlled Margin-Based Preference Optimization without Reference Model

Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces LMPO, a novel preference optimization method that controls response length and reduces probability degradation in large language models, outperforming existing techniques in benchmark tests.

Contribution

LMPO presents a length-controlled margin-based loss with a uniform reference model, improving preference optimization stability and efficiency without relying on a reference model.

Findings

01

LMPO effectively controls response length in large language models.

02

It reduces probability degradation compared to existing methods.

03

LMPO outperforms state-of-the-art preference optimization techniques on multiple benchmarks.

Abstract

Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gengxuli/lmpo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Constraint Satisfaction and Optimization

MethodsDirect Preference Optimization