ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li

TL;DR
ReFusion is a novel masked diffusion model that enhances parallel decoding in large language models by integrating sequence reorganization, leading to significant speedups and performance improvements over prior diffusion models while approaching autoregressive model quality.
Contribution
ReFusion introduces a slot-level diffusion approach with sequence reorganization, enabling full KV cache reuse and reducing learning complexity, thus improving speed and performance of language models.
Findings
34% performance gain over prior diffusion models
Over 18× average speedup compared to previous MDMs
Bridges the performance gap to autoregressive models
Abstract
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce \textsc{ReFusion}, a novel masked diffusion model that integrates sequence reorganization into the causal attention framework. By elevating parallel decoding from the token level to a higher slot level, \textsc{ReFusion} interleaves inter-slot diffusion-based selection with intra-slot autoregressive infilling, while reordering newly generated slots ahead of the remaining masks after each iteration. Consequently, this design simultaneously unlocks full KV cache reuse and reduces learning complexity…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors propose a practical ``plan-and-infilling'' process that works fairly well, although the basic idea of ReFusion might not be very novel. - The pilot study in Section 4.1 is very interesting and clearly exhibits how the distance affects the correlation. - The experiments, including the ablation studies, provide a very comprehensive understanding of how ReFusion works.
- Similar ideas have been discussed in previous works. E.g., BD3-LM (https://arxiv.org/pdf/2503.09573) utilizes the block diffusion, EDLM (https://arxiv.org/abs/2410.21357) utilize AR models to model the correlations. The authors should clearly state the differences and their unique contribution. - I find the two-step inference method in Section 4.2 somewhat obscure. I suggest the authors reorganize the desciption into mathematical equations and add more details (e.g., how to perform positional
1. The slot abstraction plus causal infill provides an intuitive route to exact KV-cache reuse, reducing efficiency gap with AR decoding. 2. The paper provides comparison with competitive baselines, and it shows consistent wins on most tasks with both LLaDA and Dream across many tasks, supporting generality. 3. The paper probes slot thresholds and provides qualitative evidence that aligns with the design intuition.
1. ReFusion introduces extra data preparation and training cost, adding pipeline complexity to realize its gains. 2. While ReFusion is much faster than prior MDMs, its throughput versus AR models is not significant better, and ReFusion may not be orthogonal to existing tricks.
1. The authors try to challenge one common belief in current dLLM literature, that we need to perform intra-block auto-regressive decoding instead of intra-block parallel decoding. The analysis is interesting, and it's also very interesting to see the challenge of common belief. 2. The proposed method is soundness, both for the training and the inference side.
My primary concern lies in the unfair, insufficient, and potentially incorrect experimental evaluations, which lead to several overstated claims in the paper. * **Missing Comparison with Block Diffusion** One of the key hypotheses this paper wants to show is that intra-block autoregressive decoding + inter-block parallel decoding is superior to intra-block parallel decoding + inter-block autoregressive decoding. However, no experiments are provided to substantiate this claim. A direct comparis
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Caching and Content Delivery
