MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs
Jiameng Li, Aleksei Tiulpin, Matthew B. Blaschko

TL;DR
This paper introduces MI-Pruner, a novel crossmodal mutual information-based token pruning method for multimodal large language models, improving efficiency without architectural changes.
Contribution
It proposes a direct mutual information measurement approach for visual token pruning, outperforming attention-based methods in efficiency and effectiveness.
Findings
Outperforms previous attention-based pruning methods
Requires no internal attention maps or architectural modifications
Demonstrates minimal latency increase
Abstract
For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
