Model Merging by Uncertainty-Based Gradient Matching
Nico Daheim, Thomas M\"ollenhoff, Edoardo Maria Ponti, Iryna Gurevych,, Mohammad Emtiyaz Khan

TL;DR
This paper introduces an uncertainty-based gradient matching method for model merging, addressing why weighted-averaging works and when it fails, leading to improved performance and robustness in large models.
Contribution
It proposes a novel uncertainty-based scheme that reduces gradient mismatch during model merging, enhancing performance and robustness over existing methods.
Findings
Improves model merging performance for language and vision models
Reduces sensitivity to hyperparameters in merging procedures
Provides theoretical insights into existing merging schemes
Abstract
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here.
Peer Reviews
Decision·ICLR 2024 poster
From my viewpoint, the weight of LLM is knowledge abstracted from data, which stresses the importance of quickly merging knowledge learned from the dataset. I believe the topics of the paper fit into this conference and have a certain inspiration for future works in this domain. The motivation of this paper is extremely clear by analyzing the gradient of merged models. When I read the paper, I enjoyed the motivation despite the heavy math.
I have some minor concerns about this paper. Before I lay out the weaknesses list, I would like to mention that I’m not an expert on NLP and my comments are probably incorrect. Finetuning vs data-driven model averaging. Maybe I don’t have the background. I’m curious about the advantage of the proposed model merging over simply fine-tuning the model. In my understanding, for the proposed method to work, we would need data to calculate the gradient matrix -- that’s why I call the proposed method
+ The authors connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve performance by reducing the mismatch. + The authors propose a unified explanation on previous model merging technique. + The new method shows consistent improvements for large language models and vision transformers in terms of performance and robustness to hyperparameters.
+ My major concern lies in the problem setup. I admit that model merging is a well-defined problem with much previous literature, as is discussed in the submission. But I still wonder why we need this technology. If we could obtain the data for each task, why don't we simply perform multi-task learning on these data? If we couldn't, how could we obtain the fisher information matrix on each task, which is required to approach Eq.12? It seems like a contradiction and I think more clarification on
- The introduction of the paper is very well written and framed the problem in clearly. - The idea of defining a ``target model`` is very useful concept in this space.
- Section 3, overall, was difficult to follow with seemingly several notational errors and cluttered paragraphs, see questions and suggestions section.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
