D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

Shuochen Chang; Xiaofeng Zhang; Qingyang Liu; Li Niu

arXiv:2511.12280·cs.CV·November 18, 2025

D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

Shuochen Chang, Xiaofeng Zhang, Qingyang Liu, Li Niu

PDF

Open Access 1 Video

TL;DR

D$^{3}$ToM introduces a dynamic token merging technique guided by decider tokens to accelerate diffusion-based multimodal large language models, maintaining performance while reducing computational complexity during inference.

Contribution

It presents a novel plug-and-play module that dynamically merges visual tokens in diffusion MLLMs, significantly speeding up inference without retraining the entire model.

Findings

01

Accelerates inference in diffusion MLLMs by merging tokens dynamically.

02

Maintains competitive performance with reduced computational cost.

03

Integrates seamlessly into existing transformer layers.

Abstract

Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D $^{3}$ ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D $^{3}$ ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

D3ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications