Length-Controlled Margin-Based Preference Optimization without Reference Model
Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu

TL;DR
This paper introduces LMPO, a novel preference optimization method that controls response length and reduces probability degradation in large language models, outperforming existing techniques in benchmark tests.
Contribution
LMPO presents a length-controlled margin-based loss with a uniform reference model, improving preference optimization stability and efficiency without relying on a reference model.
Findings
LMPO effectively controls response length in large language models.
It reduces probability degradation compared to existing methods.
LMPO outperforms state-of-the-art preference optimization techniques on multiple benchmarks.
Abstract
Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Constraint Satisfaction and Optimization
MethodsDirect Preference Optimization
