Exploiting Inter-Layer Expert Affinity for Accelerating   Mixture-of-Experts Model Inference

Jinghan Yao; Quentin Anthony; Aamir Shafi; Hari Subramoni; Dhabaleswar; K. (DK) Panda

arXiv:2401.08383·cs.LG·January 18, 2024·1 cites

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference

Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar, K. (DK) Panda

PDF

Open Access 1 Repo

TL;DR

This paper introduces ExFlow, a lightweight optimization that exploits inter-layer expert affinity in pre-trained Mixture-of-Experts models to significantly reduce communication overhead and accelerate inference on distributed systems.

Contribution

It presents a novel context-coherent expert parallelism method that requires only one Alltoall communication, improving efficiency without fine-tuning or accuracy loss.

Findings

01

Reduces cross-GPU routing latency by up to 67%

02

Achieves up to 2.2x inference throughput improvement

03

Demonstrates implicit expert affinity in pre-trained GPT MoE models

Abstract

In large language models like the Generative Pre-trained Transformer, the Mixture of Experts paradigm has emerged as a powerful technique for enhancing model expressiveness and accuracy. However, deploying GPT MoE models for parallel inference on distributed systems presents significant challenges, primarily due to the extensive Alltoall communication required for expert routing and aggregation. This communication bottleneck exacerbates the already complex computational landscape, hindering the efficient utilization of high-performance computing resources. In this paper, we propose a lightweight optimization technique called ExFlow, to largely accelerate the inference of these MoE models. We take a new perspective on alleviating the communication overhead by exploiting the inter-layer expert affinity. Unlike previous methods, our solution can be directly applied to pre-trained MoE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yjhmitweb/exflow
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Linear Layer · Dropout · Adam · Cosine Annealing · Dense Connections