SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers
Zicong Tang, Shi Luohe, Zuchao Li, Baoyuan Qi, Guoming Liu, Lefei Zhang, Ping Wang

TL;DR
SpindleKV is a new method for reducing KV cache size in large language models, balancing shallow and deep layer reduction using attention-based eviction and learned codebook replacement, improving efficiency without sacrificing performance.
Contribution
The paper introduces SpindleKV, a novel KV cache reduction technique that effectively balances shallow and deep layer reduction using attention and similarity-based methods.
Findings
Achieves better KV cache reduction compared to baseline methods.
Maintains or improves model performance after reduction.
Effective on multiple benchmarks and LLMs.
Abstract
Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Advanced Neural Network Applications
