Cross-Self KV Cache Pruning for Efficient Vision-Language Inference
Xiaohuan Pei, Tao Huang, Chang Xu

TL;DR
This paper introduces CSP, a novel method for pruning KV caches in vision-language models by decomposing attention into intra- and inter-modality components, improving efficiency without sacrificing performance.
Contribution
It proposes a training-free, decomposition-based KV cache pruning technique with an n-softmax function to better handle modality discrepancies in vision-language models.
Findings
Achieves up to 41% performance improvement on challenging tasks.
Reduces KV cache memory by 13.6%.
Outperforms previous pruning methods.
Abstract
KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation. Existing methods for vision-language models (VLMs) typically rely on self-attention scores from large language models (LLMs) to identify and prune irrelevant tokens. However, these approaches overlook the inherent distributional discrepancies between modalities, often leading to inaccurate token importance estimation and the over-pruning of critical visual tokens. To address this, we propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities), enabling more precise KV cache pruning by independently managing these distinct attention types. Additionally, we introduce an n-softmax function to counteract distribution shifts caused by pruning, preserving the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Algorithms and Data Compression
MethodsSoftmax · Attention Is All You Need · Pruning
