SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Huanxuan Liao; Yixing Xu; Shizhu He; Guanchen Li; Xuanwu Yin; Dong Li; Emad Barsoum; Jun Zhao; Kang Liu

arXiv:2508.15212·cs.CL·November 13, 2025

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu

PDF

Open Access 1 Video

TL;DR

SPARK introduces a channel-level unstructured sparsity technique for KV cache in large language models, dynamically pruning and restoring entries to improve efficiency and enable longer sequence processing without significant accuracy loss.

Contribution

It proposes a training-free, plug-and-play channel pruning method for KV caches that enhances sequence length handling and integrates with existing compression techniques.

Findings

01

Reduces KV cache storage by over 30%

02

Maintains performance with 80% pruning ratio

03

Enables processing longer sequences within same memory budget

Abstract

Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning· underline

Taxonomy

TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Topic Modeling