Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult
Cheolhun Jang

TL;DR
This paper introduces MIPO, a method that adaptively modulates the influence of a reference model during preference optimization, improving alignment especially when the reference model is poorly aligned.
Contribution
MIPO dynamically adjusts the intervention level from the reference model based on data alignment, enhancing preference optimization beyond fixed regularization approaches.
Findings
MIPO outperforms DPO across multiple benchmarks.
MIPO adapts intervention based on data alignment.
Experimental results show consistent improvement.
Abstract
Preference optimization methods typically begin training with a well-trained SFT model as a reference model. In RLHF and DPO, a regularization term is used during the preference optimization process to prevent the policy model from deviating too far from the reference model's distribution, thereby avoiding the generation of anomalous responses. When the reference model is already well-aligned with the given data or only requires slight adjustments, this approach can produce a well-aligned model. However, if the reference model is not aligned with the given data and requires significant deviation from its current state, a regularization term may actually hinder the model alignment. In this study, we propose \textbf{Modulated Intervention Preference Optimization (MIPO)} to address this issue. MIPO modulates the degree of intervention from the reference model based on how well the given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealth Systems, Economic Evaluations, Quality of Life
MethodsDirect Preference Optimization · Shrink and Fine-Tune
