Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji

TL;DR
This paper introduces Visual Tokens Withdrawal (VTW), a plug-and-play method that reduces computational costs in multimodal large language models by strategically removing visual tokens at an optimal layer, without sacrificing performance.
Contribution
The paper proposes a novel VTW technique that leverages attention sink and information migration phenomena to enable rapid inference in MLLMs by withdrawing visual tokens at a specific layer.
Findings
Reduces computational overhead by over 40% across tasks
Maintains model performance despite token withdrawal
Identifies optimal withdrawal layer using KL divergence
Abstract
Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
