FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving
Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi, David Bermbach

TL;DR
FaaSMoE introduces a serverless, multi-tenant MoE serving architecture that deploys experts as stateless functions, significantly reducing resource usage and enabling scalable, on-demand expert invocation.
Contribution
It presents a novel serverless framework for multi-tenant MoE deployment that decouples control and execution, supporting configurable expert granularity and resource efficiency.
Findings
Uses less than one third of resources compared to full-model baseline.
Achieves scalable MoE serving with on-demand expert invocation.
Demonstrates effectiveness on multi-tenant workloads with open-source prototype.
Abstract
Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside in memory, creating a gap between the resource used by activated experts and the provisioned resources. This underutilization is further pronounced in multi-tenant scenarios. In this paper, we propose FaaSMoE, a multi-tenant MoE serving architecture built on Function-as-a-Service (FaaS) platforms. FaaSMoE decouples the control and execution planes of MoE by deploying experts as stateless FaaS functions, enabling on-demand and scale-to-zero expert invocation across tenants. FaaSMoE further supports configurable expert granularity within functions, trading off per-expert elasticity for reduced invocation overhead. We implement a prototype with an open-source edge-oriented FaaS platform and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
