Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference
Danil Sivtsov, Aleksandr Katrutsa, Ivan Oseledets

TL;DR
This paper presents a topology-aware expert placement algorithm for MoE LLM inference that reduces network traffic by optimizing expert distribution across servers, improving efficiency especially for large models.
Contribution
It introduces an ILP-based placement method that considers network topology to minimize communication during MoE inference, outperforming existing strategies.
Findings
ILP-based placement reduces network traffic.
Effective for both small and large-scale models.
Improves cluster utilization during inference.
Abstract
Efficient deployment of a pre-trained LLM to a cluster with multiple servers is a critical step for providing fast responses to users' queries. The recent success of Mixture-of-Experts (MoE) LLMs raises the question of how to deploy them efficiently, considering their underlying structure. During the inference in MoE LLMs, only a small part of the experts is selected to process a given token. Moreover, in practice, the experts' load is highly imbalanced. For efficient deployment, one has to distribute the model across a large number of servers using a model placement algorithm. Thus, to improve cluster utilization, the model placement algorithm has to take into account the network topology. This work focuses on the efficient topology-aware placement of the pre-trained MoE LLMs in the inference stage. We propose an integer linear program (ILP) that determines the optimal placement of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
