Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism
Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana Mar\'ia T\'arano, Hannah Kerner

TL;DR
This paper introduces Multi-Head LatentMoE and Head Parallel, a novel architecture and parallelism for large language models that significantly reduces communication costs, balances load, and improves training speed while maintaining performance.
Contribution
The paper proposes a new architecture and parallelism method that achieves constant communication cost, balanced traffic, and deterministic communication, compatible with existing Expert Parallel techniques.
Findings
Achieves $O(1)$ communication cost regardless of expert count
Trains up to 1.61 times faster than traditional EP-based MoE
Maintains identical performance with improved efficiency
Abstract
Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts , load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving communication cost regardless of , completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to faster while having…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks
