GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless   Generative Inference of LLM

Hao Kang; Qingru Zhang; Souvik Kundu; Geonhwa Jeong; Zaoxing Liu,; Tushar Krishna; Tuo Zhao

arXiv:2403.05527·cs.LG·October 2, 2024·1 cites

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu,, Tushar Krishna, Tuo Zhao

PDF

Open Access 2 Repos

TL;DR

GEAR is a novel KV cache compression framework for LLM inference that achieves near-lossless 4-bit compression, significantly improving throughput and reducing memory usage by combining quantization, low-rank approximation, and sparsity techniques.

Contribution

GEAR introduces a new compression method that combines quantization, low-rank approximation, and sparsity to achieve high-ratio near-lossless KV cache compression for LLMs.

Findings

01

Achieves 4-bit near-lossless compression of KV cache.

02

Up to 2.38x throughput improvement.

03

Reduces peak memory size by up to 2.29x.

Abstract

Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Storage Technologies · Network Packet Processing and Optimization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings