Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks
Edan Kinderman, Itay Hubara, Haggai Maron, Daniel Soudry

TL;DR
This paper introduces FS-Merge, a data-efficient method for merging large transformers trained on different tasks and initializations into a single SuperNet, outperforming traditional methods and knowledge distillation especially in low-data scenarios.
Contribution
We propose FS-Merge, a novel SuperNet-based merging technique that is simple, data-efficient, and more expressive than traditional methods for combining diverse large transformers.
Findings
FS-Merge outperforms traditional merging methods and KD.
It achieves state-of-the-art results across various models and tasks.
FS-Merge is especially effective in low-data scenarios.
Abstract
Recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks into a single multi-task model. While most works focus on the simpler setup of merging NNs initialized from a common pre-trained network, we target the harder problem of merging large transformers trained on different tasks from distinct initializations. We show that traditional merging methods fail catastrophically in this setup, while Knowledge Distillation (KD) achieves much better results, though at a higher cost. However, KD is data-inefficient, as it does not exploit the original models' weights. To solve this, we introduce "Foldable SuperNet Merge" (FS-Merge), which trains a SuperNet containing the original models (with frozen weights) using a feature reconstruction objective. After training, the SuperNet is folded back to the size of a single original model. FS-Merge is…
Peer Reviews
Decision·Submitted to ICLR 2025
- The method is data-efficient, requiring only an unlabeled subset of the training data for optimization, which is advantageous when full access to data is limited. - The paper presents comprehensive experimental results across various model architectures and data scenarios, demonstrating the scalability and effectiveness of FS-Merge in merging models of different scales and tasks.
- While the method is designed for models trained from scratch, it would be insightful to investigate its performance when applied to pretrained models that are fine-tuned on different sources. Specifically, it would be beneficial to explore potential challenges or advantages this application might present compared to merging models trained from scratch. This analysis could provide a broader understanding of the method's applicability and limitations. - The paper could benefit from a discussion
1. The paper tackles an intriguing problem: merging large transformers trained on different tasks from distinct initializations into a single model. 2. The proposed method is simple and easy to follow.
1. FS-Merge is a training-based merging method, which can be costly compared to other model merging techniques. 2. The figures, such as Figure 3, are low resolution, and the overall writing quality of the paper is not very professional, requiring significant improvement and refinement. 3. The datasets used, such as MNIST and SVHN, are relatively small, and the performance improvements appear marginal. Also, the experimental setup seems unique to this paper and not aligned with standard practices
1. The paper is well-structured and easy to follow. 2. The paper polishes the idea of merging by folding from [1], i.e., merging two weights by inserting merging/unmerging layers, and extends it to the specific architecture of Transformer. Also the paper shows it works well when combined with knowledge distillation on unlabeled data.
1. Although the extension of the previous idea to Transformer is a technical contribution, the novelty of the proposed method is still limited since it just applies the feature-level knowledge distillation to the merging/unmerging layer which is originally proposed in [1]. 2. While the title suggests that this paper addresses the general problem of merging Transformers with different initializations, but the experiments are performed only with the Vision Transformer with a few downstream tasks.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence · Advanced Materials and Mechanics · Advanced Memory and Neural Computing
