Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar, K. (DK) Panda

TL;DR
This paper introduces ExFlow, a lightweight optimization that exploits inter-layer expert affinity in pre-trained Mixture-of-Experts models to significantly reduce communication overhead and accelerate inference on distributed systems.
Contribution
It presents a novel context-coherent expert parallelism method that requires only one Alltoall communication, improving efficiency without fine-tuning or accuracy loss.
Findings
Reduces cross-GPU routing latency by up to 67%
Achieves up to 2.2x inference throughput improvement
Demonstrates implicit expert affinity in pre-trained GPT MoE models
Abstract
In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication required for expert routing and aggregation. This communication bottleneck exacerbates the already complex computational landscape, hindering the efficient utilization of high-performance computing resources. In this paper, we propose a lightweight optimization technique called ExFlow, to largely accelerate the inference of these MoE models. We take a new perspective on alleviating the communication overhead by exploiting the inter-layer expert affinity. Unlike previous methods, our solution can be directly applied to pre-trained MoE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Linear Layer · Dropout · Adam · Cosine Annealing · Dense Connections
