A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache   Compression

Alessio Devoto; Yu Zhao; Simone Scardapane; Pasquale Minervini

arXiv:2406.11430·cs.CL·November 5, 2024

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

Alessio Devoto, Yu Zhao, Simone Scardapane, Pasquale Minervini

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a simple $L_2$ norm-based method to compress KV caches in large language models, significantly reducing memory usage while maintaining accuracy, and compatible with existing attention mechanisms.

Contribution

The paper reveals a correlation between $L_2$ norms of key embeddings and attention scores, proposing a novel, effective KV cache compression strategy based on this insight.

Findings

01

Reduces KV cache size by 50% in language modeling tasks.

02

Achieves 90% reduction in passkey retrieval tasks.

03

Maintains accuracy without relying on attention scores.

Abstract

The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_{2}$ and the attention scores over cached KV pairs, where a low $L_{2}$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression· underline

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Advanced Data Storage Technologies

MethodsSoftmax · Attention Is All You Need