A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini

TL;DR
This paper introduces a simple $L_2$ norm-based method to compress KV caches in large language models, significantly reducing memory usage while maintaining accuracy, and compatible with existing attention mechanisms.
Contribution
The paper reveals a correlation between $L_2$ norms of key embeddings and attention scores, proposing a novel, effective KV cache compression strategy based on this insight.
Findings
Reduces KV cache size by 50% in language modeling tasks.
Achieves 90% reduction in passkey retrieval tasks.
Maintains accuracy without relying on attention scores.
Abstract
The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the and the attention scores over cached KV pairs, where a low of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Advanced Data Storage Technologies
MethodsSoftmax · Attention Is All You Need
