Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

Youhe Jiang; Fangcheng Fu; Xiaozhe Yao; Guoliang He; Xupeng Miao; Ana Klimovic; Bin Cui; Binhang Yuan; Eiko Yoneki

arXiv:2502.00722·cs.DC·June 6, 2025

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, Eiko Yoneki

PDF

Open Access 1 Video

TL;DR

This paper investigates how to optimize cost-efficiency in serving large language models over diverse GPU types by benchmarking, analyzing resource demands, and designing a scheduling algorithm that outperforms existing baselines.

Contribution

It introduces a comprehensive study on heterogeneous GPU deployment for LLM serving and proposes a mixed-integer linear programming-based scheduling algorithm for cost optimization.

Findings

01

Cost-efficiency can be significantly improved by optimal GPU composition and workload assignment.

02

The proposed scheduling algorithm outperforms homogeneous and heterogeneous baselines.

03

The approach is effective across various workload scenarios and GPU availabilities.

Abstract

Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs· slideslive

Taxonomy

TopicsDistributed and Parallel Computing Systems · Advanced Data Storage Technologies · Algorithms and Data Compression