HAP: Hybrid Adaptive Parallelism for Efficient Mixture-of-Experts Inference
Haoran Lin, Xianzhi Yu, Kang Zhao, Han Bao, Zongyuan Zhan, Ting Hu, Wulong Liu, Zekun Yin, Xin Li, Weiguo Liu

TL;DR
HAP introduces a dynamic hybrid parallelism approach for MoE inference, optimizing performance across various models and hardware by adaptively selecting parallel strategies using ILP, leading to significant speedups.
Contribution
This work presents HAP, a novel adaptive parallelism method that hierarchically decomposes MoE models and employs ILP to optimize inference efficiency across diverse scenarios.
Findings
Achieves up to 1.77x speedup over TP strategy on GPUs.
Maintains high performance across different MoE models.
Demonstrates effective generalization to various hardware and model configurations.
Abstract
Current inference systems for Mixture-of-Experts (MoE) models primarily employ static parallelization strategies. However, these static approaches cannot consistently achieve optimal performance across different inference scenarios, as they lack the flexibility to adapt to varying computational requirements. In this work, we propose HAP (Hybrid Adaptive Parallelism), a novel method that dynamically selects hybrid parallel strategies to enhance MoE inference efficiency. The fundamental innovation of HAP lies in hierarchically decomposing MoE architectures into two distinct computational modules: the Attention module and the Expert module, each augmented with a specialized inference latency simulation model. This decomposition promotes the construction of a comprehensive search space for seeking model parallel strategies. By leveraging Integer Linear Programming (ILP), HAP could solve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
