EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs
Yuhao Chen, Bin Shan, Xin Ye, Cheng Chen

TL;DR
EvoPrune introduces an early-stage visual token pruning method during encoding for multimodal large language models, significantly improving inference speed with minimal performance loss in vision-language tasks.
Contribution
The paper presents a novel layer-wise pruning strategy that operates during visual encoding, unlike prior methods that prune after encoding, enhancing efficiency in MLLMs.
Findings
Achieves 2x inference speedup on VideoMME dataset
Maintains less than 1% performance degradation
Validates effectiveness across image and video benchmarks
Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
