Accelerating Multimodal Large Language Models by Searching Optimal   Vision Token Reduction

Shiyu Zhao; Zhenting Wang; Felix Juefei-Xu; Xide Xia; Miao Liu,; Xiaofang Wang; Mingfu Liang; Ning Zhang; Dimitris N. Metaxas; Licheng Yu

arXiv:2412.00556·cs.CV·December 10, 2024

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

Shiyu Zhao, Zhenting Wang, Felix Juefei-Xu, Xide Xia, Miao Liu,, Xiaofang Wang, Mingfu Liang, Ning Zhang, Dimitris N. Metaxas, Licheng Yu

PDF

Open Access

TL;DR

This paper introduces a novel method to reduce vision tokens in Multimodal Large Language Models, significantly accelerating them while maintaining or improving performance through optimal token selection strategies.

Contribution

It proposes a greedy search algorithm and a parametric sigmoid function to optimize vision token reduction, enhancing efficiency and effectiveness of MLLMs.

Findings

01

Achieves over 2x speedup in models like LLaVA and InternVL2.

02

Outperforms existing token reduction methods under limited computational budgets.

03

Maintains performance levels despite significant token reduction.

Abstract

Prevailing Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone, similar to how Large Language Models (LLMs) process the text tokens. However, the number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs. In this paper, we consider improving MLLM's efficiency from two scenarios, (I) Reducing computational cost without degrading the performance. (II) Improving the performance with given budgets. We start with our main finding that the ranking of each vision token sorted by attention scores is similar in each layer except the first layer. Based on it, we assume that the number of essential top vision tokens does not increase along layers. Accordingly, for Scenario I, we propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsSoftmax · Attention Is All You Need