MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization
Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen,, Xiang Li

TL;DR
This paper introduces MoNTA, a network-traffic-aware parallel optimization method for Mixture-of-Experts models that improves communication efficiency and training latency by optimizing parallel strategies based on network topology and communication volume.
Contribution
MoNTA is the first approach to optimize parallel training strategies for MoE models considering network traffic and topology, significantly enhancing communication performance.
Findings
8x increase in AllToAll communication performance with MoNTA
13% reduction in overall training latency for large models
Effective adaptation to network topology improves training efficiency
Abstract
The Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models from various domains into a single supermodel. This approach enables the model to scale without significantly increasing the computational costs of training and inference, while maximizing model performance. However, current distributed training frameworks do not consider the ultimate optimization of communication, especially for large base models. This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume, and the training cluster's inter-node and intra-node network topologies. Compared to the DeepSpeed, MoNTA achieves an 8x increase in AllToAll communication performance under 8-card tensor parallelism. Compared to the baseline, training a 2x70B model using 16 A800…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Energy Efficient Wireless Sensor Networks · Robotics and Automated Systems
MethodsBalanced Selection
