TL;DR
SonicMoE introduces IO and tile-aware optimizations for MoE models, significantly reducing memory footprint and boosting GPU training throughput.
Contribution
It presents a memory-efficient algorithm, GPU kernels overlapping IO with computation, and a novel token rounding method for improved MoE training efficiency.
Findings
Reduces activation memory by 45%.
Achieves 1.86x compute throughput improvement on Hopper GPUs.
Provides a 25% and 15% speedup on forward and backward passes on Blackwell GPUs.
Abstract
Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with a higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation, benefiting all MoE architectures.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
