Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded   Sparse Data Parallelism

Yuhao Qing; Guichao Zhu; Fanxin Li; Lintian Lei; Zekai Sun; Xiuxian; Guan; Shixiong Zhao; Xusheng Chen; Dong Huang; Sen Wang; and Heming Cui

arXiv:2502.02581·cs.DC·February 5, 2025

Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism

Yuhao Qing, Guichao Zhu, Fanxin Li, Lintian Lei, Zekai Sun, Xiuxian, Guan, Shixiong Zhao, Xusheng Chen, Dong Huang, Sen Wang, and Heming Cui

PDF

Open Access

TL;DR

Hecate introduces Fully Sharded Sparse Data Parallelism (FSSDP) to improve MoE training efficiency by fully sharding parameters and using sparse collectives, significantly reducing stragglers and boosting training speed.

Contribution

The paper presents FSSDP, a novel parallelization approach for MoE models that enhances memory efficiency and reduces stragglers, implemented in the Hecate system.

Findings

01

Achieves up to 3.54x speedup over existing systems

02

Reduces memory and communication overhead

03

Demonstrates consistent improvements across architectures

Abstract

Mixture-of-Experts (MoE) has emerged as a promising sparse paradigm for scaling up pre-trained models (PTMs) with remarkable cost-effectiveness. However, the dynamic nature of MoE leads to rapid fluctuations and imbalances in expert loads during training, resulting in significant straggler effects that hinder training performance when using expert parallelism (EP). Existing MoE training systems attempt to mitigate these effects through expert rearrangement strategies, but they face challenges in terms of memory efficiency and timeliness of rearrangement. This paper proposes Fully Sharded Sparse Data Parallelism (FSSDP), an innovative approach that tackles the parallelization of MoE layers and potential straggler effects caused by imbalanced expert loads from a new perspective. FSSDP fully shards the parameters and optimizer states of MoE layers across devices and sparsely materializes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis