Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, Eiko Yoneki

TL;DR
This paper investigates how to optimize cost-efficiency in serving large language models over diverse GPU types by benchmarking, analyzing resource demands, and designing a scheduling algorithm that outperforms existing baselines.
Contribution
It introduces a comprehensive study on heterogeneous GPU deployment for LLM serving and proposes a mixed-integer linear programming-based scheduling algorithm for cost optimization.
Findings
Cost-efficiency can be significantly improved by optimal GPU composition and workload assignment.
The proposed scheduling algorithm outperforms homogeneous and heterogeneous baselines.
The approach is effective across various workload scenarios and GPU availabilities.
Abstract
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDistributed and Parallel Computing Systems · Advanced Data Storage Technologies · Algorithms and Data Compression
