HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds
Chiheng Lou, Sheng Qi, Chao Jin, Dapeng Nie, Haoran Yang, Yu Ding, Xuanzhe Liu, Xin Jin

TL;DR
HydraServe is a system designed to significantly reduce cold start latency for serverless large language model serving in public clouds by proactive distribution, overlapping stages, strategic placement, and pipeline consolidation.
Contribution
HydraServe introduces novel techniques for proactive model distribution, overlapping cold start stages, strategic GPU placement, and pipeline merging to minimize cold start latency in serverless LLM serving.
Findings
Reduces cold start latency by 1.7× to 4.7×
Improves service level objective attainment by 1.43× to 1.74×
Demonstrates effectiveness across diverse settings
Abstract
With the proliferation of large language model (LLM) variants, developers are turning to serverless computing for cost-efficient LLM deployment. However, public cloud providers often struggle to provide performance guarantees for serverless LLM serving due to significant cold start latency caused by substantial model sizes and complex runtime dependencies. To address this problem, we present HydraServe, a serverless LLM serving system designed to minimize cold start latency in public clouds. HydraServe proactively distributes models across servers to quickly fetch them, and overlaps cold-start stages within workers to reduce startup latency. Additionally, HydraServe strategically places workers across GPUs to avoid network contention among cold-start instances. To minimize resource consumption during cold starts, HydraServe further introduces pipeline consolidation that can merge groups…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Caching and Content Delivery · Software-Defined Networks and 5G
