PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Zhonghua Jiang; Kunxi Li; Yiyun Zhou; Sihao Liu; Zhaode Wang; Chengfei lv; Shengyu Zhang

arXiv:2510.25600·cs.MM·October 31, 2025

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, Shengyu Zhang

PDF

3 Reviews

TL;DR

PureKV introduces a plug-and-play framework that optimizes KV cache compression and sparse attention in vision-language models, significantly improving efficiency during high-resolution input processing without sacrificing accuracy.

Contribution

The paper presents a novel, compatible KV cache compression strategy combined with Spatial-Temporal Sparse Attention tailored for efficient video VLLMs.

Findings

01

Achieves 5.0x KV cache compression

02

Attains 3.16x prefill acceleration

03

Maintains negligible quality degradation

Abstract

Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 5

Strengths

1. The cross-layer importance estimation using lower-layer attention scores and value vector norms is both novel and empirically validated. 2. The introduction of ST-SpAttn for video tasks is well-motivated and demonstrates clear benefits in both cache purification and acceleration. 3. The experiments cover multiple VLLMs, tasks, and cache budgets, showing consistent improvements

Weaknesses

1. Figure 1(c) could be further improved to more clearly illustrate the differences between dense attention and ST-SpAttn. 2. While the empirical results are strong, the theoretical justification for why lower-layer attention scores are sufficient for high-layer importance estimation could be further strengthened, for example by providing more formal analysis or theoretical bounds. 2. It would be beneficial to include an analysis of PureKV’s computational overhead in the ablation study. 3. H2O

Reviewer 02Rating 6Confidence 3

Strengths

- The idea makes sense, combining two modules together for video KV cache compression with theoretically grounded cross-layer correlation analysis. - The results are convincing, although more diverse video understanding tasks would strengthen the evaluation. - The method achieves genuine compatibility with modern attention accelerators.

Weaknesses

- The core techniques (attention-based importance scoring and sparse attention) are not novel individually; more critically, the selection of which layers to apply CLIE versus ST-SpAttn appears empirically driven rather than theoretically justified. A principled framework for determining optimal layer assignments would strengthen the contribution. - While the authors briefly mention audio-visual experiments in the appendix (AVSD dataset), these results deserve fuller integration into the main ev

Reviewer 03Rating 4Confidence 4

Strengths

The paper is clearly written and easy to follow. It introduces a genuinely plug-and-play KV-cache framework that remains compatible with high-performance attention backends (e.g., FlashAttention). A lightweight cross-layer importance estimator—reusing shallow-layer attention and weighting deep-layer V by its L2 norm—efficiently ranks tokens, preserving speed while selecting the most salient cache entries.

Weaknesses

The proposed method requires obtaining attention scores for KV-cache importance estimation, which appears incompatible with FlashAttention, as the latter does not explicitly compute or expose attention matrices.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.