Realistic Evaluation of Model Merging for Compositional Generalization
Derek Tam, Yash Kant, Brian Lester, Igor Gilitschenski, and Colin, Raffel

TL;DR
This paper provides a comprehensive evaluation of various model merging techniques across multiple tasks, clarifying their practical requirements, computational costs, and performance scalability in a unified experimental framework.
Contribution
It introduces a shared experimental setup for comparing model merging methods and analyzes their relative merits in terms of performance, requirements, and scalability.
Findings
Different merging methods vary in performance depending on the task and setting.
Computational costs increase with the number of models merged.
The study clarifies the strengths and limitations of current merging techniques.
Abstract
Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform…
Peer Reviews
Decision·Submitted to ICLR 2025
* The related works and their summary of the methods are great * I appreciated the detailed study of the experiments * Overall writing is clear
* Title: I'm not super convinced that "compositional generalization" is the "realistic" goal for model merging. Many times, model merging might not be for the emerging capability of several different tasks, but just to improve on the same held-in tasks such as the original motivation of Model Soup etc. * As for a conference-level study paper, I would expect it to reveal more surprising conclusions, insights, or theoretical motivations. Otherwise, it feels like this paper leans towards a worksh
1. The focus on evaluating compositional generalization is an important contribution, as intuitively it seems like this would be the main real-world use-case for model merging. It is surprising that prior work on model merging does not focus on compositional task generalization (if I understood the authors correctly). The task constructions are also nicely controlled and systematic, both for vision and language tasks. 2. The baselines comparison methods (such as training on some supervised data
1. The paper would be stronger if it benchmarked different base models on each task in order to make sure the core results generalize (especially different architectures, such as Transformers and CNNs on the vision tasks or Transformers and SSMs on language tasks). 2. It feels like a baseline is missing. To evaluate generalization error on held-out tasks, one could also evaluate a single fine-tuned model (one of the models being merged) on the held out task. For instance, maybe a model fine-tune
- Timely problem to study: model merging is both a scientifically interesting and currently useful practice, and a thorough empirical evaluation is very helpful. In its best form, I see this paper as being positioned to make contributions by answering two main questions, both of which I believe could really be novel and substantive contributions. 1) What is the relative ordering of merging methods? If a practitioner is interested in merging X architecture on Y task with Z budget, which method s
My main concerns revolve around the experimental design in Section 4.1. The best version of this work would help a user understand when to use model merging and which method to use, given some features about their task (e.g. architecture, modality, finetuning strategy, compute budget). This also seems to be the authors' ambition (lines 43-48). However, the experimental design makes it difficult to draw conclusions about these questions, and I'm concerned that some of the authors' conclusions hav
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeochemistry and Geologic Mapping · Mineral Processing and Grinding
