Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing
Xu Wang, Chenkai Xu, Yijie Jin, Jiachun Jin, Hao Zhang, Zhijie Deng

TL;DR
This paper introduces discrete diffusion forcing (D2F), a novel method that enables diffusion-based large language models to perform faster-than-autoregressive inference by combining block-wise generation and parallel decoding, achieving significant speed improvements.
Contribution
The paper proposes discrete diffusion forcing (D2F), a simple strategy that transforms diffusion LLMs into an efficient hybrid model capable of faster inference than autoregressive models.
Findings
D2F achieves over 2.5x inference speedup over LLaMA3 and Qwen2.5.
D2F provides more than 50x acceleration over vanilla diffusion LLMs like LLaDA and Dream.
The method maintains comparable output quality while significantly increasing inference speed.
Abstract
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for inter-block parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel…
Peer Reviews
Decision·ICLR 2026 Poster
1) The work introduces an original hybrid paradigm that enables KV‑cache friendly, AR‑style generation while retaining dLLMs’ cross‑block parallelism, and it provides a tailored asymmetric distillation procedure to train the model. 2) According to the paper, D2F yields the first open‑source dLLMs that surpass state‑of‑the‑art AR LLMs in inference speed, and achieves more than 10× speedup on some benchmarks over dLLM baselines without D2F. 3) The experimental study is comprehensive, with comparis
1) Training relies on a pretrained dLLM as the teacher, which may limit scaling to stronger future D2F variants. In addition, D2F does not accelerate training, so the compute cost remains substantial. 2) The main figure (Figure 3) is information‑sparse; a clearer depiction of the asymmetric distillation would improve readability. Table 3 could be half‑width, since the current layout leaves excessive white space.
1. The paper tackles a critical problem in discrete diffusion model: inference acceleration. D2F unlocks parallelization even across different blocks, which is a very important feature to significantly enhance the speed. 2. The method includes a distillation training for D2F and a customized inference procedure. The distillation helps mitigate the bias in block diffusion that requires the previous blocks to be fully decoded. 3. The presentation is clear and the proposed approach is clean and v
1. The technical novelty is somewhat bounded since the work is an adaptation of previous diffusion forcing literature (especially video diffusion) to discrete diffusion models. The concern is not significant though, given the promising empirical performance. 2. Several experiment results that are key to compare the accuracy-efficiency frontier and understand the design of the proposed approach are missing. See Q1 and Q2 for more details.
1. Significant Milestone: The paper achieves "faster-than-AR" inference with an open-source dLLM, a significant milestone. The reported speedups (2.5x vs. LLaMA3, 50x vs. LLaDA) are extremely impressive. 2. Effective Training Strategy: The "asymmetric distillation" with a "monotonically increasing mask schedule" is a clever adaptation of Diffusion Forcing to the discrete domain. It effectively trains the model to predict from an incomplete prefix, enabling the parallel pipeline. 3. Solves the
1. Inconsistent/Confusing Performance Claims: The paper's headline claim of "faster-than-AR" performance is made confusing by seemingly inconsistent numbers across the text and figures. This makes the exact performance trade-off difficult to assess. - LLaMA3 Baseline: In Figure 2, the LLaMA3-Instruct-8B baseline (star) is plotted with a GSM8K score of ~77. However, the text in Section 5.3 states its score is 70.1. This is a significant discrepancy. - D2F Performance: The paper reports multiple
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Machine Learning in Healthcare
