LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

Yike Zhang; Zhiyuan He; Huiqiang Jiang; Chengruidong Zhang; Yuqing Yang; Jianyong Wang; Lili Qiu

arXiv:2508.02215·cs.LG·August 5, 2025

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, Lili Qiu

PDF

Open Access 1 Video

TL;DR

LeanK is a learnable method for pruning key cache channels in large language models, significantly reducing memory usage and increasing decoding speed while maintaining accuracy.

Contribution

It introduces a novel two-stage training process for static channel pruning in key caches, optimizing efficiency without accuracy loss.

Findings

01

Up to 70% reduction in K cache memory

02

16%-18% reduction in V cache memory

03

1.3x speedup in attention computation

Abstract

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding· underline

Taxonomy

TopicsNatural Language Processing Techniques · Advanced Neural Network Applications · Big Data and Digital Economy