Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs
Skyler Wu, Aymen Echarghaoui

TL;DR
This paper proposes four new statistically sound weighting strategies for multiple-reference preference optimization in fine-tuning LLMs, demonstrating improved performance over existing methods and revealing that single-reference DPO often outperforms multi-reference approaches.
Contribution
Introduces four novel weighting strategies for MRPO, including offline and online methods, and evaluates their effectiveness on multiple LLM benchmarks.
Findings
All four strategies outperform current MRPO weighting methods.
Single-reference DPO often outperforms multi-reference approaches.
Multiple-reference methods do not consistently improve over single-reference DPO.
Abstract
Fine-tuning is integral for aligning large language models (LLMs) with human preferences. Multiple-Reference Preference Optimization (MRPO) builds on Direct Preference Optimization (DPO) by fine-tuning LLMs on preference datasets while regularizing the policy towards a mixture of reference models to leverage their collective desirable properties. However, current methods for setting the reference weights are ad-hoc and statistically unsound, leading to unreliable performance. To address this, we introduce four new weighting strategies: two offline methods that leverage held-out validation signal; one online method that uses a sliding-window estimator to reduce overfitting; and an online method that treats reference weighting as a -armed bandit via Thompson Sampling. Experiments using Qwen2.5-0.5B as the policy model and seven reference models from the Llama, Mistral, Qwen, Yi, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Multi-Objective Optimization Algorithms · Recommender Systems and Techniques
