What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou

TL;DR
This paper introduces G-Prune, a training-free graph-based method for visual token pruning in multimodal large language models, effectively reducing computation while maintaining high task performance.
Contribution
It proposes a novel graph-based visual token pruning technique that selectively retains important tokens without additional training, improving efficiency in MLLMs.
Findings
G-Prune reduces 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA.
Maintains high accuracy with only 0.95% and 2.34% drops.
Effective for both coarse- and fine-grained tasks.
Abstract
Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
