Fast Distributed Inference Serving for Large Language Models

Bingyang Wu; Yinmin Zhong; Zili Zhang; Shengyu Liu; Fangyue Liu,; Yuanhang Sun; Gang Huang; Xuanzhe Liu; Xin Jin

arXiv:2305.05920·cs.LG·September 26, 2024·22 cites

Fast Distributed Inference Serving for Large Language Models

Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu,, Yuanhang Sun, Gang Huang, Xuanzhe Liu, Xin Jin

PDF

Open Access

TL;DR

FastServe is a distributed system for LLM inference that reduces latency and increases throughput by preemptive scheduling and efficient memory management, outperforming existing solutions significantly.

Contribution

FastServe introduces a novel preemptive scheduling approach and GPU memory management tailored for LLM inference, enabling low-latency, high-throughput serving.

Findings

01

Up to 31.4x throughput improvement over vLLM.

02

Reduces inference latency and tail latency effectively.

03

Efficient GPU memory management enhances performance.

Abstract

Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Advanced Graph Neural Networks