AlpaServe: Statistical Multiplexing with Model Parallelism for Deep   Learning Serving

Zhuohan Li; Lianmin Zheng; Yinmin Zhong; Vincent Liu; Ying Sheng; Xin; Jin; Yanping Huang; Zhifeng Chen; Hao Zhang; Joseph E. Gonzalez; Ion Stoica

arXiv:2302.11665·cs.LG·July 20, 2023·20 cites

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin, Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

PDF

Open Access 2 Repos

TL;DR

AlpaServe leverages model parallelism for statistical multiplexing in deep learning serving, enabling higher request rates and burstiness handling while maintaining low latency, by optimally placing and parallelizing models across distributed clusters.

Contribution

This work introduces a novel system, AlpaServe, that uses model parallelism for efficient multiplexing of multiple models, balancing overhead and latency in serving workloads.

Findings

01

AlpaServe achieves up to 10x request rate increase.

02

Handles 6x more burstiness within latency constraints.

03

Effective model placement and parallelization strategies.

Abstract

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10x higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Age of Information Optimization