HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

Jianfeng Gu; Puxuan Wang; Isaac David Nunez Araya; Kai Huang; Michael Gerndt

arXiv:2505.01968·cs.DC·September 3, 2025

HAS-GPU: Efficient Hybrid Auto-scaling with Fine-grained GPU Allocation for SLO-aware Serverless Inferences

Jianfeng Gu, Puxuan Wang, Isaac David Nunez Araya, Kai Huang, Michael Gerndt

PDF

1 Repo

TL;DR

HAS-GPU introduces a hybrid auto-scaling architecture with fine-grained GPU resource allocation and an adaptive scheduler, significantly reducing costs and SLO violations in serverless deep learning inferences.

Contribution

It proposes a novel hybrid auto-scaling framework with fine-grained GPU management and performance prediction to improve efficiency and SLO adherence in serverless inference platforms.

Findings

01

Reduces function costs by 10.8x on average.

02

Decreases SLO violations by 4.8x.

03

Achieves 1.72x cost reduction compared to state-of-the-art frameworks.

Abstract

Serverless Computing (FaaS) has become a popular paradigm for deep learning inference due to the ease of deployment and pay-per-use benefits. However, current serverless inference platforms encounter the coarse-grained and static GPU resource allocation problems during scaling, which leads to high costs and Service Level Objective (SLO) violations in fluctuating workloads. Meanwhile, current platforms only support horizontal scaling for GPU inferences, thus the cold start problem further exacerbates the problems. In this paper, we propose HAS-GPU, an efficient Hybrid Auto-scaling Serverless architecture with fine-grained GPU allocation for deep learning inferences. HAS-GPU proposes an agile scheduler capable of allocating GPU Streaming Multiprocessor (SM) partitions and time quotas with arbitrary granularity and enables significant vertical quota scalability at runtime. To resolve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KontonGu/HAS-GPU
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.