RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Hanlin Tang; Yang Lin; Jing Lin; Qingsen Han; Shikuan Hong; Yiwu Yao,; Gongyi Wang

arXiv:2407.15891·cs.LG·July 24, 2024

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao,, Gongyi Wang

PDF

3 Reviews

TL;DR

RazorAttention is a novel, training-free KV cache compression method that selectively preserves crucial token information for key attention heads, significantly reducing cache size while maintaining model performance.

Contribution

It introduces a new cache compression technique that uses separate caching for attention heads and a compensation mechanism, improving efficiency without retraining.

Findings

01

Reduces KV cache size by over 70%

02

Maintains performance across diverse large language models

03

Compatible with existing attention mechanisms like FlashAttention

Abstract

The memory and computational demands of Key-Value (KV) cache present significant challenges for deploying long-context language models. Previous approaches attempt to mitigate this issue by selectively dropping tokens, which irreversibly erases critical information that might be needed for future queries. In this paper, we propose a novel compression technique for KV cache that preserves all token information. Our investigation reveals that: i) Most attention heads primarily focus on the local context; ii) Only a few heads, denoted as retrieval heads, can essentially pay attention to all input tokens. These key observations motivate us to use separate caching strategy for attention heads. Therefore, we propose RazorAttention, a training-free KV cache compression algorithm, which maintains a full cache for these crucial retrieval heads and discards the remote tokens in non-retrieval…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 5

Strengths

1. The proposed method demonstrates decent performance, providing a significant improvement over many token dropping-based baselines. 2. The evaluation is thorough in terms of model coverages. 3. As recognized by many recent arts, being FlashAttention compatible is a prerequisite of being a pratical long context serving method, and this work fulfills this prerequisite.

Weaknesses

1. The baseline methods (H2O and StreamingLLM) are dated and do not reflect the current SOTA of KV cache compression advancement. I understand if the authors do not intend to feature methods like KIVI or TOVA — as they are with different schools of approaches — but compare with sparse inference techniques like SnapKV or MInference are necessary. The authors' discussion around L358 regarding SnapKV is faithful, as it corresponds to the findings of SharedContextBench (also submitted to ICLR here)

Reviewer 02Rating 5Confidence 4

Strengths

The paper observes the phenomenon of "retrieval heads" aligning with observations from other works. Basing on the retrieval heads, the paper presents a straightforward token dropping based KV compression mechanism. Experiments results show improvement compare with the previous token dropping based baselines.

Weaknesses

Lack of experimental results on compression ratio: The paper compares against StreamLLM and H2O. However, StreamLLM has a very limited compression ratio, and the performance of different compression ratio in H2O also will be different. The current manuscript lack the study of how different compression ratio in RazorAttention will affect the performance. Lack of overhead evaluation on efficiency The paper claims that RazorAttention can enhance LLM inference efficiency without overhead. However,

Reviewer 03Rating 6Confidence 5

Strengths

1. The observation is neat. And experiments support the hypothesis quite well. 2. The recipe to identify the heads is very lightweight (although is this a contribution or is it previously known from anthropic paper?)

Weaknesses

1. Experiments A. why are some datasets from longbench missing like passage_retrieval etc B. Can you perform experiments on infinity benchmark for comparison. C. what is the exact KV compression in the longbench datasets? The 70% number is misleading since longbench generally has small context lengths and authors have used a context thold of 4000. Please add exact KV Cache compression numbers for all the methods to the table. 2. Novelty: Can you please elaborate on the novel

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · Focus