What Kind of Visual Tokens Do We Need? Training-free Visual Token   Pruning for Multi-modal Large Language Models from the Perspective of Graph

Yutao Jiang; Qiong Wu; Wenhao Lin; Wei Yu; Yiyi Zhou

arXiv:2501.02268·cs.CV·January 7, 2025

What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces G-Prune, a training-free graph-based method for visual token pruning in multimodal large language models, effectively reducing computation while maintaining high task performance.

Contribution

It proposes a novel graph-based visual token pruning technique that selectively retains important tokens without additional training, improving efficiency in MLLMs.

Findings

01

G-Prune reduces 63.57% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA.

02

Maintains high accuracy with only 0.95% and 2.34% drops.

03

Effective for both coarse- and fine-grained tasks.

Abstract

Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jytmelon/g-prune
pytorchOfficial

Videos

What kind of visual tokens do we need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training