Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

Wengyi Zhan; Mingbao Lin; Zhihang Lin; Rongrong Ji

arXiv:2511.18875·cs.CV·November 25, 2025

Parallel Vision Token Scheduling for Fast and Accurate Multimodal LMMs Inference

Wengyi Zhan, Mingbao Lin, Zhihang Lin, Rongrong Ji

PDF

Open Access

TL;DR

ParVTS is a training-free token scheduling method that partitions visual tokens for parallel processing, significantly reducing inference latency in multimodal large language models while maintaining accuracy.

Contribution

It introduces a novel, training-free token scheduling framework that efficiently prunes visual tokens in multimodal models without additional modules or heuristics.

Findings

01

Prunes up to 88.9% of visual tokens with minimal accuracy loss.

02

Achieves 1.77x speedup and 70% FLOPs reduction in inference.

03

Compatible with various MLLM architectures.

Abstract

Multimodal large language models (MLLMs) deliver impressive vision-language reasoning but suffer steep inference latency because self-attention scales quadratically with sequence length and thousands of visual tokens contributed by high-resolution images. Naively pruning less-informative visual tokens reduces this burden, yet indiscriminate removal can strip away contextual cues essential for background or fine-grained questions, undermining accuracy. In this paper, we present ParVTS (Parallel Vision Token Scheduling), a training-free scheduling framework that partitions visual tokens into subject and non-subject groups, processes them in parallel to transfer their semantics into question tokens, and discards the non-subject path mid-inference to reduce computation. This scheduling reduces computational complexity, requires no heuristics or additional modules, and is compatible with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling