SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Jiahui Wang; Zuyan Liu; Yongming Rao; Jiwen Lu

arXiv:2506.05344·cs.CV·July 8, 2025

SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu

PDF

Open Access 1 Repo

TL;DR

This paper discovers that only a small subset of attention heads in multimodal large language models are responsible for visual understanding, and introduces SparseMM, a method to accelerate inference by leveraging this sparsity.

Contribution

The paper reveals the sparsity phenomenon in visual heads of MLLMs and proposes SparseMM, a KV-Cache optimization strategy that improves inference efficiency while preserving accuracy.

Findings

01

Sparse heads constitute less than 5% of attention heads in MLLMs.

02

SparseMM achieves 1.38x real-time acceleration during generation.

03

Memory usage is reduced by 52% without performance loss.

Abstract

Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cr400af-a/sparsemm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need