{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Minchen Yu; Rui Yang; Chaobo Jia; Zhaoyuan Su; Sheng Yao; Tingfeng Lan; Yuchen Yang; Zirui Wang; Yue Cheng; Wei Wang; Ao Wang; Ruichuan Chen

arXiv:2502.09922·cs.DC·March 9, 2026

{\lambda}Scale: Enabling Fast Scaling for Serverless Large Language Model Inference

Minchen Yu, Rui Yang, Chaobo Jia, Zhaoyuan Su, Sheng Yao, Tingfeng Lan, Yuchen Yang, Zirui Wang, Yue Cheng, Wei Wang, Ao Wang, Ruichuan Chen

PDF

Open Access

TL;DR

{0}Scale introduces a serverless inference system that leverages RDMA networks and distributed execution to enable rapid scaling of large language models, significantly reducing latency and costs during workload spikes.

Contribution

It presents {0}Scale, a novel system combining high-speed network multicast and distributed inference to improve scaling efficiency for large models in serverless environments.

Findings

01

Achieves up to 5x tail-latency reduction

02

Reduces inference costs by 31.3%

03

Effectively handles bursty workloads

Abstract

Serverless computing has emerged as a compelling solution for cloud-based model inference. However, as modern large language models (LLMs) continue to grow in size, existing serverless platforms often face substantial model startup overhead. This poses a significant challenge in efficiently scaling model instances to accommodate dynamic, bursty workloads commonly observed in real-world inference services. In this paper, we introduce {\lambda}Scale, an efficient serverless inference system to achieve fast model scaling. The key idea behind {\lambda}Scale is to leverage high-speed RDMA networks between GPU nodes for fast model multicast, while enabling distributed inference execution during model transmission -- referred to as "execute-while-load". {\lambda}Scale proposes an efficient model scaling scheme, {\lambda}Pipe, which supports adaptive model multicast and dynamically constructs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Topic Modeling