Weighted-Reward Preference Optimization for Implicit Model Fusion
Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan

TL;DR
This paper introduces WRPO, a novel implicit fusion method for combining heterogeneous LLMs that avoids complex alignment procedures and demonstrates superior performance on multiple benchmarks.
Contribution
WRPO is a new preference optimization approach that effectively fuses diverse LLMs without vocabulary or matrix alignment, scalable and adaptable to various models.
Findings
WRPO outperforms existing fusion methods and fine-tuning baselines.
Achieves 55.9% win rate against GPT-4-Preview-1106 on AlpacaEval-2.
Achieves 46.2% win rate against GPT-4-0314 on Arena-Hard.
Abstract
While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to…
Peer Reviews
Decision·ICLR 2025 Poster
1. LLM alignment is popular and valuable research topic, especially aligning heterogenous models into a target one. 2. The proposed weighted-reward is a straightforward to to achieve an implicit model fusion process. 3. Comprehensive evaluation supports the paper statement with rich analysis.
1. Adding an analysis of alignment efficiency and ablation will be helpful, what is the difference of using different source models? if multiple source models are used, will it increase the alignment cost? 2. Some analysis parts can be more informative such as figure 3 and 4. Including more analysis and make them dense will further enhance the draft.
1. WRPO effectively introduces implicit model fusion, bypassing the need for complex vocabulary alignment and distribution merging, a significant advancement over traditional methods. 2. The novel reward mechanism balances contributions from source and target models and helps mitigate distributional discrepancies, leading to a smoother optimization process.
1. The success of WRPO appears to depend heavily on the choice and quality of source models, yet the paper does not fully address criteria or strategies for selecting these source models, which could be a limiting factor in practice. 2. The need for dynamic tuning of the fusion coefficient introduces complexity, and the paper does not sufficiently detail how this parameter was optimized across different datasets and tasks. 3. While WRPO is more efficient than traditional methods, it still requir
1. The paper explores the direction of improving the capabilities of individual models by combining the strength of multiple LLMs, which has potential benefits to improve the ability of individual models. 2. They propose a novel implicit fusion method that eliminates the need for vocabulary alignment and matrix fusion. 3. Extensive experiments demonstrate the effectiveness of proposed methods in multiple aspects.
1. Although the proposed methods demonstrate a certain degree of improvement on AlpacaEval-2 and Arena-Hard, they only have weak influences in MT-Bench. This weakens the generalization of the proposed methods. 2. The object of the proposed WRPO is to increase the likelihood of a preferred response while decreasing the occurrence of the dispreferred response. Preferred responses come from source and target models and dispreferred responses only come from the source model, which means dispreferred
Code & Models
- 🤗FuseAI/FuseChat-Qwen-2.5-7B-SFTmodel· 7 dl· ♡ 27 dl♡ 2
- 🤗FuseAI/FuseChat-Gemma-2-9B-SFTmodel· 4 dl· ♡ 44 dl♡ 4
- 🤗FuseAI/FuseChat-Qwen-2.5-7B-Instructmodel· 9 dl· ♡ 159 dl♡ 15
- 🤗FuseAI/FuseChat-Gemma-2-9B-Instructmodel· 9 dl· ♡ 79 dl♡ 7
- 🤗FuseAI/FuseChat-Llama-3.1-8B-Instructmodel· 70 dl· ♡ 1270 dl♡ 12
- 🤗FuseAI/FuseChat-Llama-3.1-8B-SFTmodel· 15 dl· ♡ 215 dl♡ 2
- 🤗FuseAI/FuseChat-Llama-3.2-3B-SFTmodel· 7 dl· ♡ 37 dl♡ 3
- 🤗FuseAI/FuseChat-Llama-3.2-1B-Instructmodel· 1.0k dl· ♡ 61.0k dl♡ 6
- 🤗FuseAI/FuseChat-Llama-3.2-1B-SFTmodel· 3 dl3 dl
- 🤗FuseAI/FuseChat-Llama-3.2-3B-Instructmodel· 55 dl· ♡ 755 dl♡ 7
Videos
Taxonomy
TopicsIndustrial Technology and Control Systems · Vehicle emissions and performance · Quality Function Deployment in Product Design
