D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs
Shuochen Chang, Xiaofeng Zhang, Qingyang Liu, Li Niu

TL;DR
D$^{3}$ToM introduces a dynamic token merging technique guided by decider tokens to accelerate diffusion-based multimodal large language models, maintaining performance while reducing computational complexity during inference.
Contribution
It presents a novel plug-and-play module that dynamically merges visual tokens in diffusion MLLMs, significantly speeding up inference without retraining the entire model.
Findings
Accelerates inference in diffusion MLLMs by merging tokens dynamically.
Maintains competitive performance with reduced computational cost.
Integrates seamlessly into existing transformer layers.
Abstract
Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose DToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, DToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
