Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
Qiong Wu, Wenhao Lin, Yiyi Zhou, Weihao Ye, Zhanpeng Zen, Xiaoshuai Sun, Rongrong Ji

TL;DR
This paper investigates the redundancy in visual tokens in multimodal large language models (MLLMs), revealing that visual tokens become unnecessary after certain inference stages, and proposes a dynamic token exit method to improve efficiency.
Contribution
The paper introduces DyVTE, a dynamic visual-token exit method that reduces visual token redundancy in MLLMs by using lightweight hyper-networks to decide token removal during inference.
Findings
DyVTE significantly improves MLLMs' efficiency.
Visual tokens become redundant after initial inference stages.
The method enhances understanding of MLLMs' inference patterns.
Abstract
The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning} resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
