RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Jingqi Xu; Jingxi Lu; Chenghao Li; Sreetama Sarkar; Souvik Kundu; Peter A. Beerel

arXiv:2511.12428·cs.CV·November 18, 2025

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Souvik Kundu, Peter A. Beerel

PDF

Open Access

TL;DR

RedVTP is a novel training-free method that accelerates diffusion vision-language model inference by pruning visual tokens based on response-driven importance, significantly improving efficiency without sacrificing accuracy.

Contribution

It introduces a response-driven visual token pruning strategy for DVLMs, leveraging inference dynamics to enhance efficiency without additional training.

Findings

01

Up to 186% increase in token generation throughput

02

Up to 64.97% reduction in inference latency

03

Maintains or improves model accuracy

Abstract

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive because they enable parallel token decoding, but the large number of visual tokens still significantly hinders their inference efficiency. While visual token pruning has been extensively studied for autoregressive VLMs (AVLMs), it remains largely unexplored for DVLMs. In this work, we propose RedVTP, a response-driven visual token pruning strategy that leverages the inference dynamics of DVLMs. Our method estimates visual token importance using attention from the masked response tokens. Based on the observation that these importance scores remain consistent across steps, RedVTP prunes the less important visual tokens from the masked tokens after the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis