EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Yuhao Chen; Bin Shan; Xin Ye; Cheng Chen

arXiv:2603.03681·cs.CV·March 5, 2026

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Yuhao Chen, Bin Shan, Xin Ye, Cheng Chen

PDF

Open Access

TL;DR

EvoPrune introduces an early-stage visual token pruning method during encoding for multimodal large language models, significantly improving inference speed with minimal performance loss in vision-language tasks.

Contribution

The paper presents a novel layer-wise pruning strategy that operates during visual encoding, unlike prior methods that prune after encoding, enhancing efficiency in MLLMs.

Findings

01

Achieves 2x inference speedup on VideoMME dataset

02

Maintains less than 1% performance degradation

03

Validates effectiveness across image and video benchmarks

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance in vision-language tasks, but their inference efficiency is severely limited by the exponential growth of visual tokens in complex scenarios such as high-resolution images and videos. Existing visual token pruning methods mainly operate after visual encoding, overlooking the substantial computational cost incurred during the encoding stage. To address this issue, we propose EvoPrune, an early-stage visual token pruning method for MLLMs that performs pruning directly during visual encoding. Specifically, EvoPrune employs a layer-wise pruning strategy guided by token similarity, diversity, and attention-based importance to retain the most informative visual tokens at selected encoding layers. Extensive experiments on image and video benchmarks validate the effectiveness of EvoPrune. In particular, on the VideoMME…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning