Drop-Upcycling: Training Sparse Mixture of Experts with Partial   Re-initialization

Taishi Nakamura; Takuya Akiba; Kazuki Fujii; Yusuke Oda; Rio Yokota,; Jun Suzuki

arXiv:2502.19261·cs.CL·March 18, 2025

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota,, Jun Suzuki

PDF

Open Access 1 Video

TL;DR

Drop-Upcycling improves training efficiency and long-term performance of sparse Mixture of Experts models by combining knowledge transfer from pre-trained dense models with strategic re-initialization, enabling large-scale models to match dense model performance with fewer resources.

Contribution

This paper introduces Drop-Upcycling, a novel method that enhances MoE training by balancing pre-trained knowledge and re-initialization, leading to better long-term performance and efficiency.

Findings

01

Outperforms previous MoE methods on large-scale tasks

02

Achieves comparable performance to larger dense models with fewer FLOPs

03

Demonstrates effectiveness on models with hundreds of billions of tokens

Abstract

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization· slideslive

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)

MethodsMixture of Experts