ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM   Serving

Tao Huang; Pengfei Chen; Kyoka Gong; Jocky Hawk; Zachary Bright,; Wenxin Xie; Kecheng Huang; Zhi Ji

arXiv:2407.09486·cs.DC·July 16, 2024·1 cites

ENOVA: Autoscaling towards Cost-effective and Stable Serverless LLM Serving

Tao Huang, Pengfei Chen, Kyoka Gong, Jocky Hawk, Zachary Bright,, Wenxin Xie, Kecheng Huang, Zhi Ji

PDF

Open Access

TL;DR

ENOVA is a comprehensive system that enables cost-effective, stable, and autoscaled serverless LLM deployment on multi-GPU clusters by automatic configuration, performance monitoring, and scheduling.

Contribution

It introduces a novel deployment, monitoring, and autoscaling framework specifically designed for serverless LLM serving on multi-GPU clusters, addressing low utilization and service quality issues.

Findings

01

ENOVA significantly outperforms existing methods in experiments.

02

It achieves high GPU utilization and stable LLM service.

03

Suitable for deployment in large-scale online systems.

Abstract

Since the increasing popularity of large language model (LLM) backend systems, it is common and necessary to deploy stable serverless serving of LLM on multi-GPU clusters with autoscaling. However, there exist challenges because the diversity and co-location of applications in multi-GPU clusters will lead to low service quality and GPU utilization. To address them, we build ENOVA, a deployment, monitoring and autoscaling service towards serverless LLM serving. ENOVA deconstructs the execution process of LLM service comprehensively, based on which ENOVA designs a configuration recommendation module for automatic deployment on any GPU clusters and a performance detection module for autoscaling. On top of them, ENOVA implements a deployment execution engine for multi-GPU cluster scheduling. The experiment results show that ENOVA significantly outperforms other state-of-the-art methods and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Grid Security and Resilience · Blockchain Technology Applications and Security · Cloud Computing and Resource Management

Methodstravel james