ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu; Zhanming Jie; Hanze Dong; Lei Wang; Xudong Lu; Aojun Zhou,; Amrita Saha; Caiming Xiong; Doyen Sahoo

arXiv:2407.21018·cs.CL·February 28, 2025

ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou,, Amrita Saha, Caiming Xiong, Doyen Sahoo

PDF

Open Access 3 Reviews

TL;DR

ThinK introduces a query-dependent pruning method for KV caches in LLMs, significantly reducing memory usage while maintaining accuracy, enabling more efficient long-sequence processing.

Contribution

The paper presents a novel pruning technique that reduces KV cache memory in LLMs by over 20% without sacrificing performance, addressing redundancy in the channel dimension.

Findings

01

Over 20% reduction in KV cache memory costs.

02

Achieves 2.8x peak memory reduction with similar accuracy.

03

Enables up to 5x batch size increase on a single GPU.

Abstract

Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 5Confidence 4

Strengths

The proposed method is simple and straight forward, with direct structured pruning on the key channel. The evaluation shows the pruned results achieves comparable results with the non-pruned baselines

Weaknesses

Lack of experiments compared to directly applying to vanilla models. Why the proposed method is basing on the pruned method, instead of directly on the vanilla model? The observations come from the fact of that there are large outliers alongside the channel. This not an unique attribute on the compressed model (i.e. H2O). Yet, experimenting on the compressed model introduces confounding variables which affect the analysis. The current manuscript lacks experiments directly apply the method on the

Reviewer 02Rating 8Confidence 4

Strengths

`+` Channel-wise sparsity for K cache isn't a new thing, but this is (probably) one of the first papers that present a solid method for channel-wise K cache pruning. `+` Compatibility with existing token eviction or cache compression methods. `+` Solid and extensive evaluation. I am convinced by tables 2 and 3 that ThinK is good at detecting and pruning insignificant channels in the K cache, since the accuracy drop is very low even for a pruning ratio of 0.4 or 0.5, so this is a reasonably d

Weaknesses

`-` Unfortunately, V cache typically demonstrate token-wise sparsity, so this method cannot be generalized to V cache. To some people, this might seem like a smart engineering hack for a specific type of tensors with known distribution properties. **(Clear: The reviewer is clear with the presentation of section #2 after the rebuttal period)** `-` The presentation of evaluation results and the choice of evaluation metrics are very concerning to me. There is an overwhelmingly large focus on accur

Reviewer 03Rating 8Confidence 3

Strengths

- Timely problem - Novelty - Solid experimental results

Weaknesses

- Comparison with related work

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems

MethodsSoftmax · Attention Is All You Need · LLaMA · Pruning