AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin, Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

TL;DR
AlpaServe leverages model parallelism for statistical multiplexing in deep learning serving, enabling higher request rates and burstiness handling while maintaining low latency, by optimally placing and parallelizing models across distributed clusters.
Contribution
This work introduces a novel system, AlpaServe, that uses model parallelism for efficient multiplexing of multiple models, balancing overhead and latency in serving workloads.
Findings
AlpaServe achieves up to 10x request rate increase.
Handles 6x more burstiness within latency constraints.
Effective model placement and parallelization strategies.
Abstract
Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10x higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Age of Information Optimization
