Iterative Length-Regularized Direct Preference Optimization: A Case   Study on Improving 7B Language Models to GPT-4 Level

Jie Liu; Zhanhui Zhou; Jiaheng Liu; Xingyuan Bu; Chao Yang; Han-Sen; Zhong; Wanli Ouyang

arXiv:2406.11817·cs.CL·June 18, 2024·1 cites

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Jie Liu, Zhanhui Zhou, Jiaheng Liu, Xingyuan Bu, Chao Yang, Han-Sen, Zhong, Wanli Ouyang

PDF

Open Access 2 Models

TL;DR

This paper introduces iterative length-regularized DPO (iLR-DPO), a method that improves 7B language models to GPT-4 level by balancing response quality and verbosity through length penalization.

Contribution

The paper proposes iLR-DPO, a novel extension of DPO that incorporates length regularization to prevent verbosity and enhance alignment with human preferences.

Findings

01

7B model achieves GPT-4 level performance on benchmarks

02

iLR-DPO improves response quality without increasing verbosity

03

Model outperforms GPT-4 in length-controlled win rate

Abstract

Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $GPT-4 Preview$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical Methods and Algorithms · Error Correcting Code Techniques · Reservoir Engineering and Simulation Methods

MethodsAttention Is All You Need · Direct Preference Optimization · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Linear Layer · Multi-Head Attention