A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Duo Li; Zuhao Yang; Xiaoqin Zhang; Ling Shao; Shijian Lu

arXiv:2511.15098·cs.CV·November 20, 2025

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu

PDF

Open Access

TL;DR

This paper investigates visual token redundancy in discrete diffusion-based multimodal large language models, revealing how redundancy varies with architecture and tasks, and exploring pruning strategies to improve efficiency without significant information loss.

Contribution

It provides a comprehensive analysis of visual token redundancy in dMLLMs, highlighting the conditions under which redundancy emerges and how different pruning methods impact model performance and efficiency.

Findings

01

Redundancy appears mainly in from-scratch dMLLMs during long-answer tasks.

02

Token pruning causes information loss but can be mitigated in from-scratch models.

03

Layer-skipping accelerates AR-to-diffusion models, while pruning benefits from late-step application.

Abstract

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs incur significant computational overhead during inference due to the full-sequence attention computation in each denoising step. Pioneer studies attempt to resolve this issue from a modality-agnostic perspective via key-value cache optimization or efficient sampling but most of them overlook modality-specific visual token redundancy. In this work, we conduct a comprehensive study on how visual token redundancy evolves with different dMLLM architectures and tasks and how visual token pruning affects dMLLM responses and efficiency. Specifically, our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications