Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models
Xindian Ma, Yidi Lu, Peng Zhang, Jing Zhang

TL;DR
This paper introduces Hierarchical Adaptive Eviction (HAE), a novel cache management framework for Multimodal Large Language Models that reduces memory and computational costs while maintaining high performance in visual and textual tasks.
Contribution
The paper proposes HAE, a new KV cache eviction strategy that optimizes token interaction and reduces resource usage in Multimodal LLMs, with theoretical guarantees and empirical improvements.
Findings
Reduces KV-Cache memory by 41% with minimal accuracy loss
Speeds up story generation inference by 1.5x
Maintains output quality on vision-language tasks
Abstract
The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper introduces a hierarchical approach that explicitly addresses the heterogeneous attention patterns between visual and textual tokens in MLLMs. The observation that visual tokens exhibit higher sparsity than text tokens, particularly in early layers, provides empirical justification for developing different methods. The Dual-Attention Pruning mechanism exploits this by computing eviction decisions only in the first layer and broadcasting them across the network, reducing both memory f
1. Theorem 2.1 assumes exponential decay with constant rate $\lambda$ (line 687), which may not reflect actual attention dynamics in transformers. The worst-case analysis provides loose bounds that may not be tight in practice. The proof relies on geometric series summation that assumes independent evictions, but KV eviction decisions are sequentially dependent. The gap between theoretical guarantees and empirical performance is not discussed. 2. Broadcasting first-layer eviction decisions wto
1. Significant reduction in KV cache memory (up to 41%) with negligible accuracy loss for MLLMs. 2. The method is training-free, making it easy to adopt for existing models. 3. This paper is well-structured and the writing is easy to flow.
1. The paper would benefit from a brief explanation of MLLMs architecture, such as the Phi-3.5 Vision-Instruct model, to help readers better understand how these models compare to pure LLMs. 2. In section 2.1 observation, the paper lacks clear definitions for *sparsity rate* and *variance*. Meanwhile, the figure quality should be improved. 3. Previous work [VLCache](https://arxiv.org/pdf/2410.23317) provides a comprehensive, layer-wise attention sparsity analysis for MLLMs. The authors appear t
- HAE’s two-stage design (Dual-Attention Pruning for pre-filling, Dynamic Decoding Eviction Strategy for decoding) is logically motivated. The pre-filling stage leverages visual token sparsity in the first layer and broadcasts eviction indices to other layers, reducing redundant computations. The decoding stage uses an OS-inspired ``recycling bin'' to avoid hasty greedy eviction, balancing efficiency and information retention. - Theoretical analyses (Theorem 2.1 on cache integrity, Corollary 2.
First, the ablation study (Table 3) is relatively shallow—more experiments on how hyperparameters (e.g., threshold r, recycling bin size) affect performance would strengthen robustness. Second, while HAE outperforms training-free baselines, its comparison with trainable methods (e.g., Dynamic-LLaVA) is limited; the gap in MMB (64.0 vs. 65.4) needs more analysis on why trainable methods still have minor advantages. Third, the case demonstration in Appendix A.3.5 lacks qualitative details—clearer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling
