Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

Chenwei Cui; Rockwell Jackson; Benjamin Joseph Herrera; Ana Mar\'ia T\'arano; Hannah Kerner

arXiv:2602.04870·cs.LG·February 5, 2026

Multi-Head LatentMoE and Head Parallel: Communication-Efficient and Deterministic MoE Parallelism

Chenwei Cui, Rockwell Jackson, Benjamin Joseph Herrera, Ana Mar\'ia T\'arano, Hannah Kerner

PDF

Open Access

TL;DR

This paper introduces Multi-Head LatentMoE and Head Parallel, a novel architecture and parallelism for large language models that significantly reduces communication costs, balances load, and improves training speed while maintaining performance.

Contribution

The paper proposes a new architecture and parallelism method that achieves constant communication cost, balanced traffic, and deterministic communication, compatible with existing Expert Parallel techniques.

Findings

01

Achieves $O(1)$ communication cost regardless of expert count

02

Trains up to 1.61 times faster than traditional EP-based MoE

03

Maintains identical performance with improved efficiency

Abstract

Large language models have transformed many applications but remain expensive to train. Sparse Mixture of Experts (MoE) addresses this through conditional computation, with Expert Parallel (EP) as the standard distributed training method. However, EP has three limitations: communication cost grows linearly with the number of activated experts $k$ , load imbalance affects latency and memory usage, and data-dependent communication requires metadata exchange. We propose Multi-Head LatentMoE and Head Parallel (HP), a new architecture and parallelism achieving $O (1)$ communication cost regardless of $k$ , completely balanced traffic, and deterministic communication, all while remaining compatible with EP. To accelerate Multi-Head LatentMoE, we propose IO-aware routing and expert computation. Compared to MoE with EP, Multi-Head LatentMoE with HP trains up to $1.61 \times$ faster while having…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks