Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli, Ouyang, Yu Qiao

TL;DR
MODPO introduces a stable, efficient RL-free method for multi-objective language model alignment, outperforming traditional RLHF approaches in safety and long-form QA while reducing computational costs.
Contribution
It extends Direct Preference Optimization to handle multiple objectives without reinforcement learning, enabling more stable and resource-efficient multi-preference alignment.
Findings
MODPO matches or exceeds existing methods in safety and QA tasks.
It produces a Pareto front of models for diverse preferences.
Requires three times less computational resources than MORLHF.
Abstract
A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences. Recent approaches therefore prefer customization, gathering multi-dimensional feedback, and creating distinct reward models for each dimension. Different language models are then optimized for various preferences using multi-objective RLHF (MORLHF) with varying reward weights. However, RL fine-tuning is unstable and resource-heavy, especially with diverse and usually conflicting objectives. In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free extension of Direct Preference Optimization (DPO) for multiple alignment objectives. Essentially, MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models that combine all objectives with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware Engineering Research · Speech and dialogue systems · Software Engineering Techniques and Practices
MethodsOPT
