DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and   DeepSpeed-Inference

Connor Holmes; Masahiro Tanaka; Michael Wyatt; Ammar Ahmad Awan; Jeff; Rasley; Samyam Rajbhandari; Reza Yazdani Aminabadi; Heyang Qin; Arash; Bakhtiari; Lev Kurilenko; Yuxiong He

arXiv:2401.08671·cs.PF·January 18, 2024·6 cites

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference

Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff, Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash, Bakhtiari, Lev Kurilenko, Yuxiong He

PDF

Open Access 1 Repo

TL;DR

DeepSpeed-FastGen is a system that significantly improves the throughput and latency of large language model serving by employing a novel prompt and generation composition strategy, making deployment more efficient and scalable.

Contribution

It introduces Dynamic SplitFuse, a new prompt and generation composition strategy, combined with DeepSpeed-MII and DeepSpeed-Inference, to enhance LLM serving performance.

Findings

01

Up to 2.3x higher effective throughput

02

2x lower average latency

03

Up to 3.7x lower tail latency

Abstract

The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/deepspeed-mii
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices