Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

Xiaohuan Pei; Tao Huang; Chang Xu

arXiv:2412.04652·cs.CV·December 9, 2024

Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

Xiaohuan Pei, Tao Huang, Chang Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces CSP, a novel method for pruning KV caches in vision-language models by decomposing attention into intra- and inter-modality components, improving efficiency without sacrificing performance.

Contribution

It proposes a training-free, decomposition-based KV cache pruning technique with an n-softmax function to better handle modality discrepancies in vision-language models.

Findings

01

Achieves up to 41% performance improvement on challenging tasks.

02

Reduces KV cache memory by 13.6%.

03

Outperforms previous pruning methods.

Abstract

KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation. Existing methods for vision-language models (VLMs) typically rely on self-attention scores from large language models (LLMs) to identify and prune irrelevant tokens. However, these approaches overlook the inherent distributional discrepancies between modalities, often leading to inaccurate token importance estimation and the over-pruning of critical visual tokens. To address this, we propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities), enabling more precise KV cache pruning by independently managing these distinct attention types. Additionally, we introduce an n-softmax function to counteract distribution shifts caused by pruning, preserving the original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

terrypei/csp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Algorithms and Data Compression

MethodsSoftmax · Attention Is All You Need · Pruning