Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism
Yuhao Qing, Guichao Zhu, Fanxin Li, Lintian Lei, Zekai Sun, Xiuxian, Guan, Shixiong Zhao, Xusheng Chen, Dong Huang, Sen Wang, and Heming Cui

TL;DR
Hecate introduces Fully Sharded Sparse Data Parallelism (FSSDP) to improve MoE training efficiency by fully sharding parameters and using sparse collectives, significantly reducing stragglers and boosting training speed.
Contribution
The paper presents FSSDP, a novel parallelization approach for MoE models that enhances memory efficiency and reduces stragglers, implemented in the Hecate system.
Findings
Achieves up to 3.54x speedup over existing systems
Reduces memory and communication overhead
Demonstrates consistent improvements across architectures
Abstract
Mixture-of-Experts (MoE) has emerged as a promising sparse paradigm for scaling up pre-trained models (PTMs) with remarkable cost-effectiveness. However, the dynamic nature of MoE leads to rapid fluctuations and imbalances in expert loads during training, resulting in significant straggler effects that hinder training performance when using expert parallelism (EP). Existing MoE training systems attempt to mitigate these effects through expert rearrangement strategies, but they face challenges in terms of memory efficiency and timeliness of rearrangement. This paper proposes Fully Sharded Sparse Data Parallelism (FSSDP), an innovative approach that tackles the parallelization of MoE layers and potential straggler effects caused by imbalanced expert loads from a new perspective. FSSDP fully shards the parameters and optimizer states of MoE layers across devices and sparsely materializes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis
