EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

Wenhui Zhu; Xiwen Chen; Zhipeng Wang; Shao Tang; Sayan Ghosh; Xuanzhao Dong; Rajat Koner; Yalin Wang

arXiv:2508.11886·cs.CV·August 19, 2025

EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models

Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Shao Tang, Sayan Ghosh, Xuanzhao Dong, Rajat Koner, Yalin Wang

PDF

Open Access

TL;DR

This paper introduces EVTP-IV, a visual token pruning method for instruction-based visual segmentation in multimodal large language models, significantly speeding up inference while maintaining accuracy.

Contribution

It proposes a novel token pruning technique that improves inference efficiency in IVS tasks by selecting a spatially representative token subset, supported by an information-theoretic analysis.

Findings

01

Achieves up to 5X speed-up on video IVS tasks

02

Maintains comparable accuracy with only 20% of tokens

03

Outperforms existing pruning methods across benchmarks

Abstract

Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference cost remains a major bottleneck, particularly in video. We empirically analyze visual token sampling in MLLMs and observe a strong correlation between subset token coverage and segmentation performance. This motivates our design of a simple and effective token pruning method that selects a compact yet spatially representative subset of tokens to accelerate inference. In this paper, we introduce a novel visual token pruning method for IVS, called EVTP-IV, which builds upon the k-center by integrating spatial information to ensure better coverage. We further provide an information-theoretic analysis to support our design. Experiments on standard IVS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications