ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

Youhe Jiang; Fangcheng Fu; Xiaozhe Yao; Taiyi Wang; Bin Cui; Ana Klimovic; Eiko Yoneki

arXiv:2502.09334·cs.DC·November 7, 2025

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki

PDF

Open Access

TL;DR

ThunderServe is a novel system that enhances large language model serving in cloud environments by optimizing deployment and adapting to dynamic conditions, significantly improving throughput and reducing latency.

Contribution

It introduces a new scheduling algorithm and lightweight re-scheduling mechanism tailored for heterogeneous cloud resources, improving performance and cost-efficiency.

Findings

01

Up to 2.1× increase in throughput

02

Up to 2.5× reduction in latency

03

Better cost-efficiency compared to state-of-the-art systems

Abstract

Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Cloud Computing and Resource Management · Software System Performance and Reliability