FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin,, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez,, Percy Liang, Christopher R\'e, Ion Stoica, Ce Zhang

TL;DR
FlexGen introduces a novel approach for high-throughput large language model inference on a single GPU by optimizing memory and computation through tensor storage patterns, weight compression, and resource aggregation.
Contribution
It presents FlexGen, a flexible inference engine that enables high throughput of LLMs on limited hardware by solving a linear programming problem for tensor management and compressing weights to 4 bits.
Findings
Achieves 1 token/sec throughput for OPT-175B on a 16GB GPU.
Enables benchmarking of 30B models within 21 hours on a single GPU.
Significantly outperforms existing offloading systems in throughput.
Abstract
The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science
