Fast Distributed Inference Serving for Large Language Models
Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu,, Yuanhang Sun, Gang Huang, Xuanzhe Liu, Xin Jin

TL;DR
FastServe is a distributed system for LLM inference that reduces latency and increases throughput by preemptive scheduling and efficient memory management, outperforming existing solutions significantly.
Contribution
FastServe introduces a novel preemptive scheduling approach and GPU memory management tailored for LLM inference, enabling low-latency, high-throughput serving.
Findings
Up to 31.4x throughput improvement over vLLM.
Reduces inference latency and tail latency effectively.
Efficient GPU memory management enhances performance.
Abstract
Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Advanced Graph Neural Networks
