Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu; Wenhao Lin; Yiyi Zhou; Weihao Ye; Zhanpeng Zen; Xiaoshuai Sun; Rongrong Ji

arXiv:2411.19628·cs.CV·July 28, 2025

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Qiong Wu, Wenhao Lin, Yiyi Zhou, Weihao Ye, Zhanpeng Zen, Xiaoshuai Sun, Rongrong Ji

PDF

Open Access 1 Repo

TL;DR

This paper investigates the redundancy in visual tokens in multimodal large language models (MLLMs), revealing that visual tokens become unnecessary after certain inference stages, and proposes a dynamic token exit method to improve efficiency.

Contribution

The paper introduces DyVTE, a dynamic visual-token exit method that reduces visual token redundancy in MLLMs by using lightweight hyper-networks to decide token removal during inference.

Findings

01

DyVTE significantly improves MLLMs' efficiency.

02

Visual tokens become redundant after initial inference stages.

03

The method enhances understanding of MLLMs' inference patterns.

Abstract

The excessive use of visual tokens in existing Multimoal Large Language Models (MLLMs) often exhibits obvious redundancy and brings in prohibitively expensive computation. To gain insights into this problem, we first conduct extensive empirical studies on the attention behaviors of MLLMs, and summarize three main inference stages in MLLMs: (i) Early fusion between tokens is first accomplished quickly. (ii) Intra-modality modeling then comes to play. (iii) Multimodal reasoning} resumes and lasts until the end of inference. In particular, we reveal that visual tokens will stop contributing to reasoning when the text tokens receive enough image information, yielding obvious visual redundancy. Based on these generalized observations, we propose a simple yet effective method to improve the efficiency of MLLMs, termed dynamic visual-token exit (DyVTE). DyVTE uses lightweight hyper-networks to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

doubtedsteam/dyvte
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training