TL;DR
PureKV introduces a plug-and-play framework that optimizes KV cache compression and sparse attention in vision-language models, significantly improving efficiency during high-resolution input processing without sacrificing accuracy.
Contribution
The paper presents a novel, compatible KV cache compression strategy combined with Spatial-Temporal Sparse Attention tailored for efficient video VLLMs.
Findings
Achieves 5.0x KV cache compression
Attains 3.16x prefill acceleration
Maintains negligible quality degradation
Abstract
Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The cross-layer importance estimation using lower-layer attention scores and value vector norms is both novel and empirically validated. 2. The introduction of ST-SpAttn for video tasks is well-motivated and demonstrates clear benefits in both cache purification and acceleration. 3. The experiments cover multiple VLLMs, tasks, and cache budgets, showing consistent improvements
1. Figure 1(c) could be further improved to more clearly illustrate the differences between dense attention and ST-SpAttn. 2. While the empirical results are strong, the theoretical justification for why lower-layer attention scores are sufficient for high-layer importance estimation could be further strengthened, for example by providing more formal analysis or theoretical bounds. 2. It would be beneficial to include an analysis of PureKV’s computational overhead in the ablation study. 3. H2O
- The idea makes sense, combining two modules together for video KV cache compression with theoretically grounded cross-layer correlation analysis. - The results are convincing, although more diverse video understanding tasks would strengthen the evaluation. - The method achieves genuine compatibility with modern attention accelerators.
- The core techniques (attention-based importance scoring and sparse attention) are not novel individually; more critically, the selection of which layers to apply CLIE versus ST-SpAttn appears empirically driven rather than theoretically justified. A principled framework for determining optimal layer assignments would strengthen the contribution. - While the authors briefly mention audio-visual experiments in the appendix (AVSD dataset), these results deserve fuller integration into the main ev
The paper is clearly written and easy to follow. It introduces a genuinely plug-and-play KV-cache framework that remains compatible with high-performance attention backends (e.g., FlashAttention). A lightweight cross-layer importance estimator—reusing shallow-layer attention and weighting deep-layer V by its L2 norm—efficiently ranks tokens, preserving speed while selecting the most salient cache entries.
The proposed method requires obtaining attention scores for KV-cache importance estimation, which appears incompatible with FlashAttention, as the latter does not explicitly compute or expose attention matrices.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
