ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou,, Amrita Saha, Caiming Xiong, Doyen Sahoo

TL;DR
ThinK introduces a query-dependent pruning method for KV caches in LLMs, significantly reducing memory usage while maintaining accuracy, enabling more efficient long-sequence processing.
Contribution
The paper presents a novel pruning technique that reduces KV cache memory in LLMs by over 20% without sacrificing performance, addressing redundancy in the channel dimension.
Findings
Over 20% reduction in KV cache memory costs.
Achieves 2.8x peak memory reduction with similar accuracy.
Enables up to 5x batch size increase on a single GPU.
Abstract
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or…
Peer Reviews
Decision·ICLR 2025 Spotlight
The proposed method is simple and straight forward, with direct structured pruning on the key channel. The evaluation shows the pruned results achieves comparable results with the non-pruned baselines
Lack of experiments compared to directly applying to vanilla models. Why the proposed method is basing on the pruned method, instead of directly on the vanilla model? The observations come from the fact of that there are large outliers alongside the channel. This not an unique attribute on the compressed model (i.e. H2O). Yet, experimenting on the compressed model introduces confounding variables which affect the analysis. The current manuscript lacks experiments directly apply the method on the
`+` Channel-wise sparsity for K cache isn't a new thing, but this is (probably) one of the first papers that present a solid method for channel-wise K cache pruning. `+` Compatibility with existing token eviction or cache compression methods. `+` Solid and extensive evaluation. I am convinced by tables 2 and 3 that ThinK is good at detecting and pruning insignificant channels in the K cache, since the accuracy drop is very low even for a pruning ratio of 0.4 or 0.5, so this is a reasonably d
`-` Unfortunately, V cache typically demonstrate token-wise sparsity, so this method cannot be generalized to V cache. To some people, this might seem like a smart engineering hack for a specific type of tensors with known distribution properties. **(Clear: The reviewer is clear with the presentation of section #2 after the rebuttal period)** `-` The presentation of evaluation results and the choice of evaluation metrics are very concerning to me. There is an overwhelmingly large focus on accur
- Timely problem - Novelty - Solid experimental results
- Comparison with related work
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
MethodsSoftmax · Attention Is All You Need · LLaMA · Pruning
