Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing
Wentao Liu, Yuhao Hu, Ruiting Zhou, Baochun Li, Ne Wang

TL;DR
Remoe is a system designed to make large language model inference more efficient and cost-effective in serverless environments by intelligently offloading experts and optimizing memory and parallel execution.
Contribution
Remoe introduces a heterogeneous MoE inference system with novel algorithms for expert activation prediction, memory management, and parallelization tailored for serverless computing.
Findings
Reduces inference cost by up to 57%
Cuts cold start latency by 47%
Achieves efficient MoE inference in serverless environments
Abstract
Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Cloud Computing and Resource Management
