MoEless: Efficient MoE LLM Serving via Serverless Computing
Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, Hao Wang

TL;DR
MoEless introduces a serverless framework for efficient MoE large language model serving, reducing latency and cost by balancing expert loads and leveraging serverless computing.
Contribution
It is the first to address expert load imbalance in serverless MoE LLM serving with lightweight load predictors and optimized expert scaling strategies.
Findings
Reduces inference latency by 43%
Cuts inference cost by 84%
Improves GPU utilization and load balancing
Abstract
Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Advanced Neural Network Applications
