DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff, Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash, Bakhtiari, Lev Kurilenko, Yuxiong He

TL;DR
DeepSpeed-FastGen is a system that significantly improves the throughput and latency of large language model serving by employing a novel prompt and generation composition strategy, making deployment more efficient and scalable.
Contribution
It introduces Dynamic SplitFuse, a new prompt and generation composition strategy, combined with DeepSpeed-MII and DeepSpeed-Inference, to enhance LLM serving performance.
Findings
Up to 2.3x higher effective throughput
2x lower average latency
Up to 3.7x lower tail latency
Abstract
The deployment and scaling of large language models (LLMs) have become critical as they permeate various applications, demanding high-throughput and low-latency serving systems. Existing frameworks struggle to balance these requirements, especially for workloads with long prompts. This paper introduces DeepSpeed-FastGen, a system that employs Dynamic SplitFuse, a novel prompt and generation composition strategy, to deliver up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency, compared to state-of-the-art systems like vLLM. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for LLMs. DeepSpeed-FastGen's advanced implementation supports a range of models and offers both non-persistent and persistent deployment options, catering to diverse user…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Ferroelectric and Negative Capacitance Devices
