Infinite-LLM: Efficient LLM Service for Long Context with DistAttention   and Distributed KVCache

Bin Lin; Chen Zhang; Tao Peng; Hanyu Zhao; Wencong Xiao; Minmin Sun,; Anmin Liu; Zhipeng Zhang; Lanbo Li; Xiafei Qiu; Shen Li; Zhigang Ji; Tao Xie,; Yong Li; Wei Lin

arXiv:2401.02669·cs.DC·July 8, 2024·6 cites

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun,, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie,, Yong Li, Wei Lin

PDF

Open Access

TL;DR

Infinite-LLM introduces a system that disaggregates attention layers and uses pooled GPU memory to efficiently serve large language models with extremely long contexts, significantly improving throughput and resource utilization.

Contribution

The paper presents a novel LLM serving system that handles dynamic context lengths by disaggregating attention layers and pooling GPU memory, enabling scalable and efficient inference.

Findings

01

Achieves 1.35-3.4x throughput improvement over state-of-the-art methods.

02

Supports context lengths up to 2000K tokens across 32 GPUs.

03

Demonstrates effective resource management for dynamic attention behavior.

Abstract

Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly dynamic behavior of the attention layers, showcasing significant differences in computational characteristics and memory requirements from the non-attention layers. This presents substantial challenges for resource management and performance optimization in service systems. Existing static model parallelism and resource allocation strategies fall short when dealing with this dynamicity. To address the issue, we propose Infinite-LLM, a novel LLM serving system designed to effectively handle dynamic context lengths. Infinite-LLM disaggregates attention layers from an LLM's inference process, facilitating flexible and independent resource scheduling that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques