Static Batching of Irregular Workloads on GPUs: Framework and   Application to Efficient MoE Model Inference

Yinghan Li; Yifei Li; Jiejing Zhang; Bujiao Chen; Xiaotong Chen; Lian; Duan; Yejun Jin; Zheng Li; Xuanyu Liu; Haoyu Wang; Wente Wang; Yajie Wang,; Jiacheng Yang; Peiyang Zhang; Laiwen Zheng; Wenyuan Yu

arXiv:2501.16103·cs.DC·January 28, 2025

Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference

Yinghan Li, Yifei Li, Jiejing Zhang, Bujiao Chen, Xiaotong Chen, Lian, Duan, Yejun Jin, Zheng Li, Xuanyu Liu, Haoyu Wang, Wente Wang, Yajie Wang,, Jiacheng Yang, Peiyang Zhang, Laiwen Zheng, Wenyuan Yu

PDF

Open Access

TL;DR

This paper introduces a static batching framework for irregular GPU workloads, specifically optimizing MoE model inference to significantly improve GPU utilization and throughput.

Contribution

It presents a novel static batching framework with runtime task mapping for irregular workloads and applies it to optimize MoE model inference on GPUs.

Findings

01

Achieves up to 91% of peak Tensor Core throughput on H800 GPU.

02

Achieves up to 95% of peak Tensor Core throughput on H20 GPU.

03

Demonstrates significant efficiency improvements in MoE inference.

Abstract

It has long been a problem to arrange and execute irregular workloads on massively parallel devices. We propose a general framework for statically batching irregular workloads into a single kernel with a runtime task mapping mechanism on GPUs. We further apply this framework to Mixture-of-Experts (MoE) model inference and implement an optimized and efficient CUDA kernel. Our MoE kernel achieves up to 91% of the peak Tensor Core throughput on NVIDIA H800 GPU and 95% on NVIDIA H20 GPU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management

MethodsMixture of Experts