Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota,, Jun Suzuki

TL;DR
Drop-Upcycling improves training efficiency and long-term performance of sparse Mixture of Experts models by combining knowledge transfer from pre-trained dense models with strategic re-initialization, enabling large-scale models to match dense model performance with fewer resources.
Contribution
This paper introduces Drop-Upcycling, a novel method that enhances MoE training by balancing pre-trained knowledge and re-initialization, leading to better long-term performance and efficiency.
Findings
Outperforms previous MoE methods on large-scale tasks
Achieves comparable performance to larger dense models with fewer FLOPs
Demonstrates effectiveness on models with hundreds of billions of tokens
Abstract
The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
MethodsMixture of Experts
