Boosting Multimodal Large Language Models with Visual Tokens Withdrawal   for Rapid Inference

Zhihang Lin; Mingbao Lin; Luxi Lin; Rongrong Ji

arXiv:2405.05803·cs.CV·January 28, 2025

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Visual Tokens Withdrawal (VTW), a plug-and-play method that reduces computational costs in multimodal large language models by strategically removing visual tokens at an optimal layer, without sacrificing performance.

Contribution

The paper proposes a novel VTW technique that leverages attention sink and information migration phenomena to enable rapid inference in MLLMs by withdrawing visual tokens at a specific layer.

Findings

01

Reduces computational overhead by over 40% across tasks

02

Maintains model performance despite token withdrawal

03

Identifies optimal withdrawal layer using KL divergence

Abstract

Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lzhxmu/vtw
pytorchOfficial

Videos

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training