Lossless KV Cache Compression to 2%

Zhen Yang; J.N.Han; Kan Wu; Ruobing Xie; An Wang; Xingwu Sun; Zhanhui; Kang

arXiv:2410.15252·cs.CL·October 22, 2024

Lossless KV Cache Compression to 2%

Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui, Kang

PDF

Open Access

TL;DR

This paper presents CLLA, a novel architecture that compresses large language model KV caches to under 2% of their original size without losing performance, enabling more efficient inference.

Contribution

Introduction of CLLA, a comprehensive framework combining attention reduction, layer sharing, and quantization for near-lossless KV cache compression in language models.

Findings

01

Achieves lossless performance on most tasks.

02

Reduces KV cache size to less than 2%.

03

Enhances inference efficiency significantly.

Abstract

Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Algorithms and Data Compression · Embedded Systems Design Techniques

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings