TL;DR
This paper introduces SPES, a memory-efficient decentralized framework for pretraining large mixture-of-experts language models across distributed GPUs, reducing memory use and communication costs while maintaining competitive performance.
Contribution
The authors propose a novel decentralized training method for MoE LLMs that trains only subsets of experts per node and introduces expert-merging warm-up, enabling training of large models with less memory and communication.
Findings
Trained a 2B-parameter MoE LLM on 16 GPUs with competitive results.
Successfully scaled to 7B and 9B models matching centralized baselines.
Achieved training over internet connections with reduced memory footprint.
Abstract
Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
