Merging Multi-Task Models via Weight-Ensembling Mixture of Experts
Anke Tang, Li Shen, Yong Luo, Nan Yin, Lefei Zhang, Dacheng Tao

TL;DR
This paper introduces a novel method for merging multi-task Transformer models using a weight-ensembling mixture of experts, which dynamically integrates shared and task-specific knowledge to improve performance and mitigate parameter interference.
Contribution
The paper proposes a dynamic MoE-based merging approach that separates shared and task-specific knowledge, enhancing multi-task model integration beyond static methods.
Findings
Effective multi-task model merging demonstrated
Improved generalization and robustness shown
Dynamic integration outperforms static methods
Abstract
Merging various task-specific Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently. Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable. Existing methods have primarily focused on seeking a static optimal solution within the original model parameter space. A notable challenge is mitigating the interference between parameters of different models, which can substantially deteriorate performance. In this paper, we propose to merge most of the parameters while upscaling the MLP of the Transformer layers to a weight-ensembling mixture of experts (MoE) module, which can dynamically integrate shared and task-specific knowledge based on the input, thereby providing a more flexible solution that can adapt to the specific needs of each instance. Our key insight is that by identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Human Mobility and Location-Based Analysis · Recommender Systems and Techniques
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Residual Connection · Absolute Position Encodings · Dropout · Layer Normalization
