MoEless: Efficient MoE LLM Serving via Serverless Computing

Hanfei Yu; Bei Ouyang; Shwai He; Ang Li; Hao Wang

arXiv:2603.06350·cs.DC·March 9, 2026

MoEless: Efficient MoE LLM Serving via Serverless Computing

Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, Hao Wang

PDF

Open Access

TL;DR

MoEless introduces a serverless framework for efficient MoE large language model serving, reducing latency and cost by balancing expert loads and leveraging serverless computing.

Contribution

It is the first to address expert load imbalance in serverless MoE LLM serving with lightweight load predictors and optimized expert scaling strategies.

Findings

01

Reduces inference latency by 43%

02

Cuts inference cost by 84%

03

Improves GPU utilization and load balancing

Abstract

Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Advanced Neural Network Applications