MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models

Xin Ye; Daning Cheng; Boyang Zhang; Yunquan Zhang

arXiv:2601.06857·cs.LG·January 13, 2026

MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models

Xin Ye, Daning Cheng, Boyang Zhang, Yunquan Zhang

PDF

Open Access

TL;DR

MoE-DisCo introduces a staged training approach for Mixture-of-Experts models that significantly reduces training costs by decomposing the model into submodels trained on low-cost hardware, then integrating and fine-tuning on high-end GPUs.

Contribution

The paper presents a novel staged training framework for MoE models that enables cost-effective training on affordable hardware without sacrificing performance.

Findings

01

Achieves comparable or better performance than full training.

02

Reduces training costs by up to 69.5%.

03

Effective across multiple downstream tasks.

Abstract

Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware is low-cost but constrained by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination), a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning