MOSLIM:Align with diverse preferences in prompts through reward classification
Yu Zhang, Wanli Jiang, and Zhengyu Yang

TL;DR
MOSLIM introduces a flexible multi-objective alignment method for LLMs using a single reward model and policy, enabling diverse preference handling without preference-specific training, and outperforms existing methods efficiently.
Contribution
This work presents MOSLIM, a novel approach that uses a multi-head reward model and a single policy to align LLMs with multiple preferences without preference-specific training.
Findings
Outperforms current multi-objective methods in most benchmarks.
Requires fewer GPU resources compared to existing policy optimization techniques.
Effective across various reward model sizes and optimization methods.
Abstract
The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper structure is well-organized and the main concepts are well-articulated. The methodology section is clear and sound with sufficient summary of the previous methods at start. The differences between the proposed method with baselines are explicitly demonstrated through the figures. In the experiments section, the outline at the front is followed by corresponding verification of the experiments results.
1. Experiment designs may need improvements. - Although the authors mention the latest work RiC and CDPO, it seems that there are not direct comparison results shown in the manuscript. If feasible, it would be more persuasive if the authors could show the superiority of MOSLIM over these two methods through the comparisons in the performance or training cost. - For Figure 5, I am not sure whether it is a fair comparison with two baselines since these are not trained with certain granularity of
- By employing a single reward model for diverse preferences, MOSLIM significantly reduces computational overhead. This enables off-the-shelf models without requiring fine-tuning for each new preference. - The framework shows potential scalability, given its ability to handle multiple preference objectives without complex adjustments during training.
- While MOSLIM addresses multi-objective alignment, the core approach builds on established methods, particularly prompt-driven alignment and multi-head reward models. The contributions are incremental rather than ground-breaking, as the framework primarily refines and consolidates existing techniques. - The writing is hard to understand, and the word usage is inconsistent in the paper (e.g., Both RSoup and RSopu are used.) - Limited comparison with a few baselines (Rsoup and MORLHF). Please add
- MOSLIM outperforms existing methods such as MORLHF and Rewarded Soups, while achieving controllable alignment across different preference dimensions and intensities. - The model’s effectiveness is thoroughly validated across several benchmarks (MT-Bench, HaluEval 2.0, Hackaprompt). The study explores various model scales and compares different algorithms (PPO, RLOO, Online-DPO), demonstrating MOSLIM’s robustness across configurations.
- Contribution and Novelty: The approach of using a multi-head RM for preference alignment has been introduced in prior work [1,2], which may also support multi-objective preference classification. Additionally, the fast inference strategy utilized by MOSLIM shares similarities with previous efforts to dynamically adjust preferences, such as Rewards-in-Context [3]. While MOSLIM mentions that some methods use SFT loss primarily to enhance core abilities (lines 55-62), this claim may overlook the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
