SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu

TL;DR
This paper discovers that only a small subset of attention heads in multimodal large language models are responsible for visual understanding, and introduces SparseMM, a method to accelerate inference by leveraging this sparsity.
Contribution
The paper reveals the sparsity phenomenon in visual heads of MLLMs and proposes SparseMM, a KV-Cache optimization strategy that improves inference efficiency while preserving accuracy.
Findings
Sparse heads constitute less than 5% of attention heads in MLLMs.
SparseMM achieves 1.38x real-time acceleration during generation.
Memory usage is reduced by 52% without performance loss.
Abstract
Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need
