Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding
Runpeng Yu, Xinyin Ma, Xinchao Wang

TL;DR
Dimple introduces a novel discrete diffusion multimodal large language model that combines autoregressive and diffusion training phases, achieving high performance, improved inference efficiency, and precise response control.
Contribution
The paper presents the first DMLLM with a combined training paradigm, a confident decoding strategy, and structured response control, advancing multimodal large language model capabilities.
Findings
Dimple-7B surpasses LLaVA-NEXT by 3.9% in performance.
Confident decoding reduces generation iterations to one-third.
Prefilling technique speeds up inference by 1.5x to 7x.
Abstract
In this work, we propose Dimple, the first Discrete Diffusion Multimodal Large Language Model (DMLLM). We observe that training with a purely discrete diffusion approach leads to significant training instability, suboptimal performance, and severe length bias issues. To address these challenges, we design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase. This approach yields the Dimple-7B model, trained on the same dataset and using a similar training pipeline as LLaVA-NEXT. Dimple-7B ultimately surpasses LLaVA-NEXT in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models. To improve inference efficiency, we propose a decoding strategy termed confident decoding, which dynamically adjusts the number of tokens generated at each step, significantly reducing the number of…
Peer Reviews
Decision·Submitted to ICLR 2026
The strengths of the paper are: 1. Originality: Proposes a hybrid training paradigm (AR pretraining + diffusion) for dMLLMs, which is empirically effective. Introduces confident decoding and structure prior, enabling dynamic parallel decoding and fine-grained output control—capabilities not available in AR models. 2. Clarity: Clear motivation, methodology, and results presentation. 3. Significance: The proposed techniques (confident decoding, structure prior) are likely to inspire further rese
The weaknesses of the paper are: 1. Novelty: The proposed ideas are engineering centric. The confident decoding with flexible steps is proposed in literature I believe. I don't weigh too high for the overall novelty of the paper, but is okay in some sense. 2. Ablation on Structure Prior: While qualitative examples are provided, a more systematic quantitative evaluation of the structure prior’s impact would strengthen the claims. 3. Lack of comparisons and experimental details.
- Two pragmatic routes to dMLLMs: AR->diffusion (Dimple) and AR‑MLLM initialization->diffusion (Dimple+). Confident Decoding adaptively sets the number of updated tokens per step, unlike fixed‑K schedules in prior work. Structure Prior gives direct positional control, early answering and enforced formats, difficult for AR models. - Competitive accuracy vs. matched AR baselines on 12 benchmarks and clear SOTA among dMLLMs with far fewer training samples than LLaDA‑V. Ablations quantify Prefillin
- While Dimple+ is SOTA within discrete diffusion MLLMs, it lags a strong AR MLLM (Qwen2.5‑VL‑7B) on several tasks (e.g., ChartQA 74.7 vs. 87.3; OCRBench 699 vs. 783; Table 1), making the overall value proposition partly about parity plus speed/controllability, not raw accuracy. - The paper claims parity under the same budget, but AR vs. diffusion differ in token supervision density and FLOPs/step; matching only by iterations or tokens may not equate compute. A FLOPs‑normalized comparison would
1. The authors explore the diffusion paradigm for MLLMs, and investigate this manner under the setting of pure LLMs and MLLMs. 2. A confident decoding is proposed to accelerate the inference of MLLMs, which can achieve 2x-6x speedups.
1. The experimental designs and results are hard to support the arguments. For instance, compared to the default QWen2.5-VL-7B, Dimple+ encounters obvious performance drops on multiple benchmarks. Moreover, the results of Dimple-7B-AR baseline is also questionable. Compared to LLaVA-Next which uses a much weaker LLM, Dimple-AR-baseline perform worse on multiple benchmarks. These results are quite unreasonable. If the authors want to proof the merits of diffusion, they can make comparisons to o
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsDiffusion
