Modulated Intervention Preference Optimization (MIPO): Keep the Easy,   Refine the Difficult

Cheolhun Jang

arXiv:2409.17545·cs.CL·September 30, 2024

Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult

Cheolhun Jang

PDF

Open Access

TL;DR

This paper introduces MIPO, a method that adaptively modulates the influence of a reference model during preference optimization, improving alignment especially when the reference model is poorly aligned.

Contribution

MIPO dynamically adjusts the intervention level from the reference model based on data alignment, enhancing preference optimization beyond fixed regularization approaches.

Findings

01

MIPO outperforms DPO across multiple benchmarks.

02

MIPO adapts intervention based on data alignment.

03

Experimental results show consistent improvement.

Abstract

Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model's distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose \textbf{Modulated Intervention Preference Optimization (MIPO)} to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHealth Systems, Economic Evaluations, Quality of Life

MethodsDirect Preference Optimization · Shrink and Fine-Tune