Weighted-Reward Preference Optimization for Implicit Model Fusion

Ziyi Yang; Fanqi Wan; Longguang Zhong; Tianyuan Shi; Xiaojun Quan

arXiv:2412.03187·cs.CL·February 27, 2025

Weighted-Reward Preference Optimization for Implicit Model Fusion

Ziyi Yang, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan

PDF

Open Access 4 Repos 10 Models 3 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces WRPO, a novel implicit fusion method for combining heterogeneous LLMs that avoids complex alignment procedures and demonstrates superior performance on multiple benchmarks.

Contribution

WRPO is a new preference optimization approach that effectively fuses diverse LLMs without vocabulary or matrix alignment, scalable and adaptable to various models.

Findings

01

WRPO outperforms existing fusion methods and fine-tuning baselines.

02

Achieves 55.9% win rate against GPT-4-Preview-1106 on AlpacaEval-2.

03

Achieves 46.2% win rate against GPT-4-0314 on Arena-Hard.

Abstract

While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. LLM alignment is popular and valuable research topic, especially aligning heterogenous models into a target one. 2. The proposed weighted-reward is a straightforward to to achieve an implicit model fusion process. 3. Comprehensive evaluation supports the paper statement with rich analysis.

Weaknesses

1. Adding an analysis of alignment efficiency and ablation will be helpful, what is the difference of using different source models? if multiple source models are used, will it increase the alignment cost? 2. Some analysis parts can be more informative such as figure 3 and 4. Including more analysis and make them dense will further enhance the draft.

Reviewer 02Rating 6Confidence 3

Strengths

1. WRPO effectively introduces implicit model fusion, bypassing the need for complex vocabulary alignment and distribution merging, a significant advancement over traditional methods. 2. The novel reward mechanism balances contributions from source and target models and helps mitigate distributional discrepancies, leading to a smoother optimization process.

Weaknesses

1. The success of WRPO appears to depend heavily on the choice and quality of source models, yet the paper does not fully address criteria or strategies for selecting these source models, which could be a limiting factor in practice. 2. The need for dynamic tuning of the fusion coefficient introduces complexity, and the paper does not sufficiently detail how this parameter was optimized across different datasets and tasks. 3. While WRPO is more efficient than traditional methods, it still requir

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper explores the direction of improving the capabilities of individual models by combining the strength of multiple LLMs, which has potential benefits to improve the ability of individual models. 2. They propose a novel implicit fusion method that eliminates the need for vocabulary alignment and matrix fusion. 3. Extensive experiments demonstrate the effectiveness of proposed methods in multiple aspects.

Weaknesses

1. Although the proposed methods demonstrate a certain degree of improvement on AlpacaEval-2 and Arena-Hard, they only have weak influences in MT-Bench. This weakens the generalization of the proposed methods. 2. The object of the proposed WRPO is to increase the likelihood of a preferred response while decreasing the occurrence of the dispreferred response. Preferred responses come from source and target models and dispreferred responses only come from the source model, which means dispreferred

Code & Models

Repositories

Models

Datasets

Videos

Weighted-Reward Preference Optimization for Implicit Model Fusion· slideslive

Taxonomy

TopicsIndustrial Technology and Control Systems · Vehicle emissions and performance · Quality Function Deployment in Product Design