ZipIt! Merging Models from Different Tasks without Training
George Stoica, Daniel Bolya, Jakob Bjorner, Pratik Ramesh, Taylor, Hearn, Judy Hoffman

TL;DR
ZipIt! introduces a novel method for merging deep visual recognition models trained on different tasks without additional training, by expanding feature merging and supporting partial zipping to create multi-task models.
Contribution
The paper presents ZipIt!, a general approach that merges models from different tasks without retraining, using feature expansion and partial zipping strategies.
Findings
Achieves 20-60% improvement over prior model merging methods.
Enables creation of multi-task models without retraining.
Effectively merges models trained on disjoint tasks.
Abstract
Typical deep visual recognition models are capable of performing the one task they were trained on. In this paper, we tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model without any additional training. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features within each model by defining a general "zip" operation.…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper studies a new and interesting problem: merging differently initialized models trained on different tasks into a single model without retraining. This relaxes the condition of the model merging problem, recently studied in works such as Model Soups and Git Re-Basin, which assumes the same task over models. This makes the problem more challenging while allowing for broader use cases. 2. The idea of merging weights both within and across models is simple, but achieves reasonably good
1. My main concern is the stability of the method. I am not sure if the method works robustly across different settings, because empirical observations are reported in the github issues that a small change (e.g. changing the learning rate and epoch numbers when training the models to merge) can cause the merged model to crash to low accuracy. 2. These are not weaknesses but there are some limitations. The performance drop increase when merging multiple models. Accuracy drop looks large when mer
This paper proposes a novel method named “ZipIt!” which can maintain the performance after merging two models trained from two disjoint tasks, which prior works fail. The partially zipping method can further give the option to do a trade-off between the performance and the FLOPs. Finally, a theoretical proof is given to ensure the existence of a transformation matrix.
The method seems not to be practical for real-world usage, while lacks insights and theoretical analysis of the model properties. The method works fine in the CIFAR dataset, but doesn’t have remarkable advantages in large datasets such as ImageNet, and disjoint tasks. Moreover, the paper lacks comparisons between its method and the model trained on both datasets directly, which limits its usage scenario.
1. This paper addresses the interesting research topic of merging models on different datasets without additional training. 2. Large-scale experiments such as imagenet1k confirm the effectiveness of the proposed method. This paper conducts experiments in multi-dataset merging, classification and segmentation merging. In particular, this is the first effort, to my knowledge, to merge the different modalities of classification and segmentation. 3. For reproducibility, the authors mentioned that th
1. The paper extends feature matching between models to within the model in order to merging models across different datasets. Its contribution should be examined in more detail. In particular, there is no theoretical support as to why matching the feature would lower the loss after merging. I did not understand the connection between Theorem 1 and the fact that joint Acc can be sufficiently large for merging models on different datasets. Is the extension to "within merge" not only more flexible
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
