SuperMerge: An Approach For Gradient-Based Model Merging
Haoyu Yang, Zheng Zhang, Saket Sathe

TL;DR
SuperMerge introduces a gradient-based model merging technique that efficiently combines multiple fine-tuned models for different tasks, enabling quick updates and maintaining high performance without extensive retraining.
Contribution
The paper presents SUPERMERGE, a novel lightweight, fast, gradient-based model merging approach with a hierarchical strategy to handle multiple tasks efficiently.
Findings
SUPERMERGE achieves comparable performance to fully fine-tuned models.
It outperforms existing model merging methods on NLP and computer vision tasks.
The hierarchical merging reduces space requirements without performance loss.
Abstract
Large language models, such as ChatGPT, Claude, or LLaMA, are gigantic, monolithic, and possess the superpower to simultaneously support thousands of tasks. However, high-throughput applications often prefer smaller task-specific models because of their lower latency and cost. One challenge of using task-specific models is the incremental need for solving newer tasks after the model is already deployed for existing tasks. A straightforward solution requires fine-tuning the model again for both existing and new tasks, which is computationally expensive and time-consuming. To address this issue, we propose a model merging based approach called SUPERMERGE. SUPERMERGE is a gradient-based method to systematically merge several fine-tuned models trained on existing and new tasks. SUPERMERGE is designed to be lightweight and fast, and the merged model achieves similar performance to fully…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The performance of SuperMerge is quite strong compared to the cited works, across all settings. 2. SuperMerge is an efficient learning framework, that achieves strong performance from few labeled examples. 3. The authors motivate their approach decently, though experimental results are the more compelling motivation. 4. The authors ablate SuperMerge well, showing that tanh delivers strong performance results and enables more flexible merges compared to existing work. 5. With the exception of
**Weaknesses** 1. **Insufficient baseline comparison:** Given that the authors propose a gradient-based model merging approach, I believe MaTS (https://arxiv.org/pdf/2312.04339) should be compared against. If I understand the settings between SuperMerge and MaTS correctly, MaTS appears to perform *dramatically better* than SuperMerge in the IA3 setting. I would imagine that SuperMerge is significantly more computationally efficient than MaTS, though it is difficult to tell given that the reporte
1. SuperMerge is a gradient-based model merging method that can generalize to out of domain datasets. 2. The number of merged models are larger than previous works. 2. The authors also propose a hierarchical merging method that can reduce memory footage, with a slight performance decrease.
I recommend that the authors reorganize their paper, as the current version is difficult to follow. The current presentation of the paper does not meet the standard of ICLR in general. The major concern is that the experimental results cannot support the claims. In addition, I'm afraid that I have to disagree with some general statements that the authors made in the Introduction, see Questions point 1 in the following section. 1. **Reorganization of Presentation:** 1. In Section 4, it would
**Efficient Model Merging**: SUPERMERGE enables fast, gradient-based merging of task-specific models, avoiding repetitive and costly fine-tuning. **Enhanced Performance**: SUPERMERGE demonstrates superior accuracy over other model-merging methods across NLP and computer vision tasks. **Robust Generalization**: It performs well on out-of-domain data, showing adaptability to unseen tasks. **Minimal Computational Overhead**: The method requires significantly fewer parameters and computational re
I contend that there are the following weaknesses: 1. The paper should focus more on discussing the practical applications of this model merging method that requires no additional training and describe the specific advantages of this approach compared to multi-task learning. It seems that model merging may consume a large amount of GPU memory, especially for large language models (LLMs). If one aims to merge multiple LLMs, the memory usage could be extremely high. 2. Currently, LLMs typically
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques
MethodsLLaMA
