FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Minghe Wang; Trever Schirmer; Mohammadreza Malekabbasi; David Bermbach

arXiv:2604.26881·cs.DC·April 30, 2026

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi, David Bermbach

PDF

TL;DR

FaaSMoE introduces a serverless, multi-tenant MoE serving architecture that deploys experts as stateless functions, significantly reducing resource usage and enabling scalable, on-demand expert invocation.

Contribution

It presents a novel serverless framework for multi-tenant MoE deployment that decouples control and execution, supporting configurable expert granularity and resource efficiency.

Findings

01

Uses less than one third of resources compared to full-model baseline.

02

Achieves scalable MoE serving with on-demand expert invocation.

03

Demonstrates effectiveness on multi-tenant workloads with open-source prototype.

Abstract

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.