TL;DR
HAS-GPU introduces a hybrid auto-scaling architecture with fine-grained GPU resource allocation and an adaptive scheduler, significantly reducing costs and SLO violations in serverless deep learning inferences.
Contribution
It proposes a novel hybrid auto-scaling framework with fine-grained GPU management and performance prediction to improve efficiency and SLO adherence in serverless inference platforms.
Findings
Reduces function costs by 10.8x on average.
Decreases SLO violations by 4.8x.
Achieves 1.72x cost reduction compared to state-of-the-art frameworks.
Abstract
Serverless Computing (FaaS) has become a popular paradigm for deep learning inference due to the ease of deployment and pay-per-use benefits. However, current serverless inference platforms encounter the coarse-grained and static GPU resource allocation problems during scaling, which leads to high costs and Service Level Objective (SLO) violations in fluctuating workloads. Meanwhile, current platforms only support horizontal scaling for GPU inferences, thus the cold start problem further exacerbates the problems. In this paper, we propose HAS-GPU, an efficient Hybrid Auto-scaling Serverless architecture with fine-grained GPU allocation for deep learning inferences. HAS-GPU proposes an agile scheduler capable of allocating GPU Streaming Multiprocessor (SM) partitions and time quotas with arbitrary granularity and enables significant vertical quota scalability at runtime. To resolve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
