Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large   Language Model

Ting Liu; Liangtao Shi; Richang Hong; Yue Hu; Quanjun Yin; Linfeng; Zhang

arXiv:2411.10803·cs.CV·November 19, 2024

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng, Zhang

PDF

Open Access 1 Repo

TL;DR

MustDrop is a multi-stage token dropping method for multimodal large language models that improves inference efficiency by accurately identifying and removing redundant vision tokens across encoding, prefilling, and decoding stages.

Contribution

It introduces a comprehensive multi-stage token importance measurement and filtering strategy that enhances efficiency without sacrificing accuracy in multimodal LLMs.

Findings

01

Reduces about 88.5% FLOPs on LLaVA.

02

Achieves a compression ratio of 92.2%.

03

Maintains comparable accuracy with full-token models.

Abstract

The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during inference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle, including the vision encoding stage, prefilling stage, and decoding stage. Concretely, in the visual encoding stage, MustDrop merges spatially adjacent tokens with high similarity, and establishes a key token set to retain the most vision-critical tokens, preventing them from being discarded in later stages. In the prefilling stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuting20/mustdrop
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training