Sparsity Meets Similarity: Leveraging Long-Tail Distribution for Dynamic Optimized Token Representation in Multimodal Large Language Models
Gaotong Yu, Yi Chen, Jian Xu

TL;DR
This paper introduces a dynamic token pruning method for multimodal large language models that leverages the long-tail distribution of visual token similarities to significantly reduce computational costs while maintaining performance.
Contribution
The paper proposes a novel dynamic pruning algorithm that trims visual tokens based on similarity distribution and further prunes low-correlation tokens in the LLM layer, improving efficiency.
Findings
Achieves comparable performance with only 22% of original tokens
Effectively reduces computational costs in multimodal LLMs
Demonstrates the importance of long-tail distribution in token similarity
Abstract
Recently, multimodal large language models (MM-LLMs) have achieved significant success in various tasks, but their high computational costs limit widespread application. The main computational burden arises from processing concatenated text and visual tokens in the LLM layer, where input token length directly affects efficiency. Our analysis of visual tokens reveals that their similarity to the CLS token follows a long-tail distribution, with only a few showing high similarity. To address this, we propose a dynamic pruning algorithm that identifies the inflection point in the visual CLS token similarity curve, enabling effective trimming of visual markers to accelerate model performance. Additionally, we perform a second round of pruning in the LLM layer, filtering out low-correlation tokens through the interaction between visual and textual features. Experimental results demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Geographic Information Systems Studies
MethodsPruning
