Model Merging by Uncertainty-Based Gradient Matching

Nico Daheim; Thomas M\"ollenhoff; Edoardo Maria Ponti; Iryna Gurevych,; Mohammad Emtiyaz Khan

arXiv:2310.12808·cs.LG·August 26, 2024·2 cites

Model Merging by Uncertainty-Based Gradient Matching

Nico Daheim, Thomas M\"ollenhoff, Edoardo Maria Ponti, Iryna Gurevych,, Mohammad Emtiyaz Khan

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces an uncertainty-based gradient matching method for model merging, addressing why weighted-averaging works and when it fails, leading to improved performance and robustness in large models.

Contribution

It proposes a novel uncertainty-based scheme that reduces gradient mismatch during model merging, enhancing performance and robustness over existing methods.

Findings

01

Improves model merging performance for language and vision models

02

Reduces sensitivity to hyperparameters in merging procedures

03

Provides theoretical insights into existing merging schemes

Abstract

Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here.

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

From my viewpoint, the weight of LLM is knowledge abstracted from data, which stresses the importance of quickly merging knowledge learned from the dataset. I believe the topics of the paper fit into this conference and have a certain inspiration for future works in this domain. The motivation of this paper is extremely clear by analyzing the gradient of merged models. When I read the paper, I enjoyed the motivation despite the heavy math.

Weaknesses

I have some minor concerns about this paper. Before I lay out the weaknesses list, I would like to mention that I’m not an expert on NLP and my comments are probably incorrect. Finetuning vs data-driven model averaging. Maybe I don’t have the background. I’m curious about the advantage of the proposed model merging over simply fine-tuning the model. In my understanding, for the proposed method to work, we would need data to calculate the gradient matrix -- that’s why I call the proposed method

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

+ The authors connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve performance by reducing the mismatch. + The authors propose a unified explanation on previous model merging technique. + The new method shows consistent improvements for large language models and vision transformers in terms of performance and robustness to hyperparameters.

Weaknesses

+ My major concern lies in the problem setup. I admit that model merging is a well-defined problem with much previous literature, as is discussed in the submission. But I still wonder why we need this technology. If we could obtain the data for each task, why don't we simply perform multi-task learning on these data? If we couldn't, how could we obtain the fisher information matrix on each task, which is required to approach Eq.12? It seems like a contradiction and I think more clarification on

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

- The introduction of the paper is very well written and framed the problem in clearly. - The idea of defining a ``target model`` is very useful concept in this space.

Weaknesses

- Section 3, overall, was difficult to follow with seemingly several notational errors and cluttered paragraphs, see questions and suggestions section.

Code & Models

Repositories

ukplab/iclr2024-model-merging
noneOfficial

Videos

Model Merging by Uncertainty-Based Gradient Matching· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques