Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference
Yinghan Li, Yifei Li, Jiejing Zhang, Bujiao Chen, Xiaotong Chen, Lian, Duan, Yejun Jin, Zheng Li, Xuanyu Liu, Haoyu Wang, Wente Wang, Yajie Wang,, Jiacheng Yang, Peiyang Zhang, Laiwen Zheng, Wenyuan Yu

TL;DR
This paper introduces a static batching framework for irregular GPU workloads, specifically optimizing MoE model inference to significantly improve GPU utilization and throughput.
Contribution
It presents a novel static batching framework with runtime task mapping for irregular workloads and applies it to optimize MoE model inference on GPUs.
Findings
Achieves up to 91% of peak Tensor Core throughput on H800 GPU.
Achieves up to 95% of peak Tensor Core throughput on H20 GPU.
Demonstrates significant efficiency improvements in MoE inference.
Abstract
It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
MethodsMixture of Experts
