What Matters for Model Merging at Scale?
Prateek Yadav, Tu Vu, Jonathan Lai, Alexandra Chronopoulou, Manaal, Faruqui, Mohit Bansal, Tsendsuren Munkhdalai

TL;DR
This paper systematically evaluates large-scale model merging, revealing that merging effectiveness improves with larger models and stronger base models, and that merging enhances generalization, often surpassing multitask training.
Contribution
It provides a comprehensive analysis of model merging at scale, exploring how size, base model quality, and merging methods influence performance, which was previously underexplored.
Findings
Merging is more effective with strong base models.
Larger models facilitate easier merging.
Merging improves generalization, sometimes outperforming multitask training.
Abstract
Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We…
Peer Reviews
Decision·Submitted to ICLR 2025
- This paper systematically reveals the impact of different model sizes, quality, quantity, and merging methods on the effectiveness of model merging. - The figures and tables in this paper are very clear. - This paper is well organized and clearly written.
- Some inconsistencies lack explanation: - (1) In Figure 1, why is multi-tasking better than single-tasking in 8B and 24B, but multi-tasking is not better than single-tasking in 1B and 64B? How does this relate to model size? - (2) In Figure 5 (PaLM-2-24B, PaLM-2-64B), why is the generalization performance when the number of experts is 8 not as good as when the number of experts is 6? Why does the TIES method perform worse than the pre-trained model when the number of experts increases i
- This is a comprehensive evaluation, systematically examines multiple factors (model size, base model quality, number of experts, and merging methods) across a large-scale experimental setup, providing robust insights.
- It seems when comparing merging pretrained "experts" and finetuned "experts", after the merging process, the pretrained one is never finetuned. I think it might be unfair to compare between a never finetuned checkpoints and a finetuned checkpoints (althrough it is a merged checkpoint). And thus, it is very natural to predict that merging finetuned "experts" is better than merging pretrained "experts". - All the tasks (held-in and held-out) are text based. It would be better if involving some v
1. The objective of this study is to offer profound insights regarding the scalability aspect of model merging, which indeed represents a significant direction within the realm of "scaling". 2. The research presented herein exhibits a comprehensive and meticulous experimental design, which encompasses multiple dimensions such as model sizes, merging methods, and the count of experts. The results are presented in a highly satisfactory manner. Through a sequence of well-conducted experiments, it h
1. The fact that the study's exclusive concentration lies on PaLM-based models does give rise to legitimate concerns regarding the generalizability of the findings to other architectural frameworks such as GPT, LLaMA, and Qwen. 2. Incomplete theoretical exploration: The paper is heavily empirical, lacking necessary theoretical analysis to explain the observed phenomena. For example,the relationship between weight disentanglement and merging effectiveness. 3. Constraints in Experimental Design:Th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques
MethodsBalanced Selection
