Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Jiedong Zhuang; Lu Lu; Ming Dai; Rui Hu; Jian Chen; Qiang Liu; Haoji Hu

arXiv:2602.01901·cs.CV·February 3, 2026

Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu

PDF

Open Access

TL;DR

This paper introduces Lazy Attention and Q Cache, a novel method that reduces redundant attention computation in multimodal large language models by sharing attention patterns across layers, significantly improving efficiency with minimal performance loss.

Contribution

The paper proposes Lazy Attention and Q Cache, enabling cross-layer attention sharing in MLLMs, reducing computation and cache usage without compromising accuracy.

Findings

01

KV cache usage reduced by over 35%

02

Achieves 1.5x throughput improvement

03

Maintains approximately 99% of model performance

Abstract

Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling