Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa; Zijian He; Reyna Abhyankar; Dongming Li; Yiying; Zhang

arXiv:2407.00023·cs.DC·October 4, 2024·1 cites

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying, Zhang

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

Preble is a novel distributed prompt scheduling platform for large language models that significantly reduces latency by optimizing KV state reuse and load balancing across multiple GPUs.

Contribution

It introduces the first distributed prompt scheduling system that co-optimizes KV reuse and computation load balancing for scalable LLM serving.

Findings

01

Preble achieves 1.5X to 14.5X latency reduction over state-of-the-art systems.

02

Preble reduces p99 latency by 2X to 10X.

03

Evaluation on real workloads demonstrates substantial performance improvements.

Abstract

Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices are to include domain-specific instructions, illustration of tool usages, and/or long context such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests. Recent works propose to cache and reuse KV state of prompts. However, they are all confined to a single-GPU optimization, while production LLM serving systems are distributed by nature. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and request arrival patterns on two…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 3Confidence 3

Strengths

The scheduling algorithm seems interesting. The authors introduce a lot of background of LLM in the appendix.

Weaknesses

I think the major problem is that the authors mix the concept of high-level cluster scheduling with low-level GPU kernel design, which is very confusing. In Section 3.1, the overall design, the term “data parallel” is used inaccurately. Typically, data parallelism refers to the process of dividing data into chunks and distributing these chunks across blocks and threads in low-level kernel design, instead of splitting data to different servers at a high level. The statement in the abstract, “pro

Reviewer 02Rating 6Confidence 4

Strengths

The paper achieved performance improvements comparing to prompt caching mechanisms (vLLM and SG-lang). The paper further considered fairness into scheduling process in the optimization in prompt cache.

Weaknesses

The major concern is that contribution and novelty of this paper is limited. Distributed prompt sharing optimization across GPUs: Optimize the inference among multiple GPUs (even in cluster level system) is previously studied by various work. As some of the previous work has been cited by the paper, mem-serve[1], inference without interference [2], mooncake[3] are all large scale systems serves beyond a single GPU. Those systems all considered shared prompts, and can be used with long context

Reviewer 03Rating 6Confidence 3

Strengths

- The authors propsed a first system to address efficient prompt-sharing in distributed LLM environments. - The proposed E2 scheduling algorithm seems to effectively balance prefix cache sharing and computation load across GPUs and the authors show significant performance gain compared to SGLang.

Weaknesses

- Althought the paper claim strong scalability, it is only partially supported by experiments on two four-GPU machines.

Code & Models

Repositories

wuklab/preble
noneOfficial

Videos

Preble: Efficient Distributed Prompt Scheduling for LLM Serving· slideslive

Taxonomy

TopicsDistributed and Parallel Computing Systems · Petri Nets in System Modeling · Distributed systems and fault tolerance