MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models
Xin Ye, Daning Cheng, Boyang Zhang, Yunquan Zhang

TL;DR
MoE-DisCo introduces a staged training approach for Mixture-of-Experts models that significantly reduces training costs by decomposing the model into submodels trained on low-cost hardware, then integrating and fine-tuning on high-end GPUs.
Contribution
The paper presents a novel staged training framework for MoE models that enables cost-effective training on affordable hardware without sacrificing performance.
Findings
Achieves comparable or better performance than full training.
Reduces training costs by up to 69.5%.
Effective across multiple downstream tasks.
Abstract
Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware is low-cost but constrained by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination), a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
