Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan; Hongxiao Bai; Xin Yao; Dennis Liu; Tong Liu; Hongbin Liu; Pingtian Li; Evan Wu; Shiqing Fan; Li Tao; Robin Zhang; Yuzhong Wang; Shifang Xu; Jack Chang; Xuwen Chen; Kunlun Li; Yan Bai; Gao Deng; Nan Zheng; Vijay Anand Korthikanti; Abhinav Khattar; Ethan He; Soham Govande; Sangkug Lym; Zhongbo Zhu; Qi Zhang; Haochen Yuan; Xiaowei Ren; Deyu Fu; Tailai Ma; Shunkang Zhang; Jiang Shao; Ray Wang; Vasudevan Rengasamy; Rachit Garg; Santosh Bhavani; Xipeng Li; Chandler Zhou; David Wu; Yingcan Wei; Ashwath Aithal; Michael Andersch; Mohammad Shoeybi; Jiajie Yao; June Yang (NVIDIA)

arXiv:2603.07685·cs.DC·March 11, 2026

Scalable Training of Mixture-of-Experts Models with Megatron Core

Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, Robin Zhang, Yuzhong Wang, Shifang Xu, Jack Chang, Xuwen Chen, Kunlun Li, Yan Bai, Gao Deng, Nan Zheng, Vijay Anand Korthikanti, Abhinav Khattar, Ethan He

PDF

Open Access

TL;DR

This paper presents a comprehensive system for scalable training of Mixture-of-Experts models using Megatron Core, addressing memory, communication, and computation challenges to enable efficient training of very large models.

Contribution

It introduces integrated system optimizations and a flexible parallelism framework for efficient, scalable MoE training on large GPU clusters, with open-source implementation.

Findings

01

Achieves over 1,200 TFLOPS/GPU on large models

02

Supports low-precision training (FP8, NVFP4)

03

Enables training of models with trillions of parameters

Abstract

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation, creating coupled constraints across memory, communication, and computation. Optimizing one dimension often shifts pressure to another, demanding co-design across the full system stack. We address these challenges for MoE training through integrated optimizations spanning memory (fine-grained recomputation, offloading, etc.), communication (optimized dispatchers, overlapping, etc.), and computation (Grouped GEMM, fusions, CUDA Graphs, etc.). The framework also provides Parallel Folding for flexible multi-dimensional parallelism, low-precision training support for FP8 and NVFP4, and efficient long-context training. On NVIDIA GB300 and GB200, it achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Mobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques