Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

Wentao Liu; Yuhao Hu; Ruiting Zhou; Baochun Li; Ne Wang

arXiv:2512.18674·cs.DC·December 23, 2025

Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing

Wentao Liu, Yuhao Hu, Ruiting Zhou, Baochun Li, Ne Wang

PDF

Open Access

TL;DR

Remoe is a system designed to make large language model inference more efficient and cost-effective in serverless environments by intelligently offloading experts and optimizing memory and parallel execution.

Contribution

Remoe introduces a heterogeneous MoE inference system with novel algorithms for expert activation prediction, memory management, and parallelization tailored for serverless computing.

Findings

01

Reduces inference cost by up to 57%

02

Cuts cold start latency by 47%

03

Achieves efficient MoE inference in serverless environments

Abstract

Mixture-of-Experts (MoE) has become a dominant architecture in large language models (LLMs) due to its ability to scale model capacity via sparse expert activation. Meanwhile, serverless computing, with its elasticity and pay-per-use billing, is well-suited for deploying MoEs with bursty workloads. However, the large number of experts in MoE models incurs high inference costs due to memory-intensive parameter caching. These costs are difficult to mitigate via simple model partitioning due to input-dependent expert activation. To address these issues, we propose Remoe, a heterogeneous MoE inference system tailored for serverless computing. Remoe assigns non-expert modules to GPUs and expert modules to CPUs, and further offloads infrequently activated experts to separate serverless functions to reduce memory overhead and enable parallel execution. We incorporate three key techniques: (1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Cloud Computing and Resource Management