SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers

Zicong Tang; Shi Luohe; Zuchao Li; Baoyuan Qi; Guoming Liu; Lefei Zhang; Ping Wang

arXiv:2507.06517·cs.CL·July 10, 2025

SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers

Zicong Tang, Shi Luohe, Zuchao Li, Baoyuan Qi, Guoming Liu, Lefei Zhang, Ping Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

SpindleKV is a new method for reducing KV cache size in large language models, balancing shallow and deep layer reduction using attention-based eviction and learned codebook replacement, improving efficiency without sacrificing performance.

Contribution

The paper introduces SpindleKV, a novel KV cache reduction technique that effectively balances shallow and deep layer reduction using attention and similarity-based methods.

Findings

01

Achieves better KV cache reduction compared to baseline methods.

02

Maintains or improves model performance after reduction.

03

Effective on multiple benchmarks and LLMs.

Abstract

Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tyxqc/spindlekv
pytorchOfficial

Videos

SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers· underline

Taxonomy

TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Advanced Neural Network Applications