MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong; Juhua Hu; Mian Zhang; Ming Yin; Yanjie Fu; Qi Qian

arXiv:2508.18264·cs.CV·March 4, 2026

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Sixun Dong, Juhua Hu, Mian Zhang, Ming Yin, Yanjie Fu, Qi Qian

PDF

3 Reviews

TL;DR

This paper introduces MMTok, a multimodal token selection method that improves inference efficiency of vision-language models by leveraging both vision and text tokens to select informative vision tokens, achieving significant speedups with minimal performance loss.

Contribution

The paper proposes a novel multimodal coverage maximization approach for token pruning, addressing the limitations of unimodal methods and demonstrating superior efficiency and performance on benchmark datasets.

Findings

01

Achieves 1.87x speedup with 98.7% performance retention on LLaVA-NeXT-13B.

02

Preserves 87.7% of original performance with only four vision tokens.

03

Multimodal token selection outperforms unimodal baselines.

Abstract

Vision-Language Models (VLMs) demonstrate impressive performance in understanding visual content with language instruction by converting visual inputs to vision tokens. However, redundancy in vision tokens results in the degraded inference efficiency of VLMs. While many algorithms have been proposed to reduce the number of vision tokens, most of them apply only unimodal information (i.e., vision/text) for pruning and ignore the inherent multimodal property of vision-language tasks. Moreover, it lacks a generic criterion that can be applied to different modalities. To mitigate this limitation, in this work, we propose to leverage both vision and text tokens to select informative vision tokens by the coverage criterion. We first formulate the subset selection problem as a maximum coverage problem. Afterwards, a subset of vision tokens is optimized to cover the text tokens and the original…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The paper addresses an important and practical problem: improving inference efficiency of VLMs without fine-tuning. And the multimodal coverage formulation is well-motivated, combining both text-vision and vision-vision similarity in a principled manner. - Extensive experiments across multiple models (LLaVA-1.5, LLaVA-NeXT, Qwen) and datasets demonstrate the generality and effectiveness of the method. - The method achieves impressive results, e.g., 1.87× speedup on LLaVA-NeXT-13B with 98.7%

Weaknesses

- My main concern lies in the efficiency part. The authors only provide efficiency test on base version. The agent-based text enrichment is presented as optional and shows mixed results, but its computational overhead and when it is most beneficial are not thoroughly discussed. - What is the efficiency cost under the practical experimental settings? e.g. The experiments in Table 3 with Qwen2.5-VL. - How does the time complexity of MMTok scale with the number of vision tokens, especially compared

Reviewer 02Rating 4Confidence 4

Strengths

1. **Reasonable idea.** Using cross-modal correlation for visual token compression is reasonable. 2. **Solid experiments.** MMTok works with LLava-1.5, LLava-Next and Qwen2.5-VL baseline, especially with cutting-edge SOTA VLMs like Qwen2.5-VL. This demonstrates the robustness and effectiveness of this method. 3. **Training-free.** MMTok is training-free and plug-and-play, easy to apply to multiple foundation VLMs.

Weaknesses

1. **Compare with resize baseline.** Experiment results in [1] show that simply resizing the raw image yields good performance and low latency. I am looking forward to the author comparing your approach with the simple resize approach and adding a discussion in the paper. 2. **Work on multi-turn conversation.** MMTok relies on text instruction to prune vision tokens, therefore hard to apply with multi-turn conversation. The author should discuss this situation and propose some solutions. 3. **

Reviewer 03Rating 6Confidence 4

Strengths

1. Formulating vision token pruning as a maximum coverage problem is both novel and insightful. This formulation naturally captures visual–textual interactions by optimizing vision tokens to jointly cover textual semantics and the original visual space, providing a solid mathematical foundation for multimodal token selection. 2. The authors conduct impressive experiments on five models and demonstrate improvements over previous methods. Moreover, the ablation study is detailed, particularly in

Weaknesses

1. The proposed VLM agent offers limited contribution. As shown in Table 1, it yields at most a 0.2 improvement and even leads to performance degradation in some cases, while introducing additional time overhead. 2. Compared with finetuned VisionZip, MMTok performs notably better under the 64-token setting. However, its advantage diminishes under the 128- and 192-token settings. I suggest evaluating MMTok on more challenging benchmarks, such as MMStar and MathVista, to better highlight its stre

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.