PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM   Inference

Dongjie Yang; XiaoDong Han; Yan Gao; Yao Hu; Shilin Zhang; Hai Zhao

arXiv:2405.12532·cs.CL·June 6, 2024·1 cites

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao

PDF

Open Access 2 Repos 1 Video

TL;DR

PyramidInfer is a novel method that compresses the KV cache in large language models by layer-wise retaining crucial context, significantly reducing memory usage and increasing inference throughput without performance loss.

Contribution

It introduces a layer-wise KV cache compression technique based on attention weight consistency, addressing inter-layer dependencies and pre-computation memory issues.

Findings

01

2.2x throughput improvement over baseline

02

54% GPU memory reduction in KV cache

03

Maintains model performance with reduced memory usage

Abstract

Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference, hindering their scalability for real-time applications like chatbots. To accelerate inference, we store computed keys and values (KV cache) in the GPU memory. Existing methods study the KV cache compression to reduce memory by pruning the pre-computed KV cache. However, they neglect the inter-layer dependency between layers and huge memory consumption in pre-computation. To explore these deficiencies, we find that the number of crucial keys and values that influence future generations decreases layer by layer and we can extract them by the consistency in attention weights. Based on the findings, we propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context. PyramidInfer saves significant memory by computing fewer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference· underline

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies

MethodsPruning