AdaMerging: Adaptive Model Merging for Multi-Task Learning
Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei, Wang, Dacheng Tao

TL;DR
AdaMerging introduces an unsupervised, adaptive approach to effectively merge multiple pre-trained models for multi-task learning, improving performance and robustness without requiring original training data.
Contribution
This paper presents AdaMerging, a novel unsupervised method that learns optimal model merging coefficients automatically, enhancing multi-task model integration.
Findings
Achieves 11% performance improvement over state-of-the-art task arithmetic merging.
Demonstrates superior generalization to unseen downstream tasks.
Shows increased robustness to data distribution shifts during testing.
Abstract
Multi-task learning (MTL) aims to empower a model to tackle multiple tasks simultaneously. A recent development known as task arithmetic has revealed that several models, each fine-tuned for distinct tasks, can be directly merged into a single model to execute MTL without necessitating a retraining process using the initial training data. Nevertheless, this direct addition of models often leads to a significant deterioration in the overall performance of the merged model. This decline occurs due to potential conflicts and intricate correlations among the multiple tasks. Consequently, the challenge emerges of how to merge pre-trained models more effectively without using their original training data. This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging). This approach aims to autonomously learn the coefficients for model merging, either in a task-wise…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper studies model merging without original data, which is an important research direction. 2. This paper proposes an unsupervised model merging scheme, which is technically feasible. Experimental results show that the proposed scheme has better multi-task performance, generalization, and robustness. 3. The paper is well organized and easy to understand, and the proposed solutions are easy to follow and implement.
1. In the motivation, the authors need to explain the intuitive motivation for entropy minimization as a proxy objective for loss. 2. In the experimental analysis, the author needed to explain why AdaMerging has better generalization and robustness.
- **Good writing**: The paper is well written and easy to follow, also with good illustrative figures. - **Interesting research direction**: Task vector arithmetic for foundation model is a novel and interesting take on multi-task learning. Extending it to learning the task vector weights seems like a natural and meaningful direction, and very much in-line with automatic loss/gradient weighing scheme in standard multi-task optimization methods. - **Good set of ablation experiments**: The resu
- The conclusion of **Section 3.2.2** seems a bit strong to me from the conducted experiment: the analysis shows that the entropy and loss of a trained MTL model are nicely correlated, but it does not necessarily mean that they yield equally good directions during training: Doing the same analysis at different timesteps during the MTL model training could show whether and how this correlation holds during training. - **Discrepancy in supervision** : If I understood correctly, the single-task an
* Originality: Merging multiple fine-tuned models has been shown feasible, but this paper proposed and proved that using testing-time entropy as an objective to learn merging weights is effective and can be automatic. They also suggested that learning weights across different layers is crucial to the success of merging. These show the strength in originality. * Quality: The results of the experiments are solid and promising. They closed the performance gap between conventional MTL and task arith
* Though we don’t need to train model again via original training data, we still need to access a certain amount of testing data for testing-time-entropy minimization. How the (minimum) amount of testing data can affect the quality of merging weights ($\lambda$ in the paper) is encouraged to study and present in the paper. * In additional to the amount of testing data, the burden/computational needs/computational time to learn $\lambda$ to converge via unlabeled testing samples is missing and la
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Multimodal Machine Learning Applications
