ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Jia-Nan Li; Jian Guan; Wei Wu; Chongxuan Li

arXiv:2512.13586·cs.CL·March 6, 2026

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li

PDF

Open Access 1 Models 2 Datasets 3 Reviews

TL;DR

ReFusion is a novel masked diffusion model that enhances parallel decoding in large language models by integrating sequence reorganization, leading to significant speedups and performance improvements over prior diffusion models while approaching autoregressive model quality.

Contribution

ReFusion introduces a slot-level diffusion approach with sequence reorganization, enabling full KV cache reuse and reducing learning complexity, thus improving speed and performance of language models.

Findings

01

34% performance gain over prior diffusion models

02

Over 18× average speedup compared to previous MDMs

03

Bridges the performance gap to autoregressive models

Abstract

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce \textsc{ReFusion}, a novel masked diffusion model that integrates sequence reorganization into the causal attention framework. By elevating parallel decoding from the token level to a higher slot level, \textsc{ReFusion} interleaves inter-slot diffusion-based selection with intra-slot autoregressive infilling, while reordering newly generated slots ahead of the remaining masks after each iteration. Consequently, this design simultaneously unlocks full KV cache reuse and reduces learning complexity…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The authors propose a practical ``plan-and-infilling'' process that works fairly well, although the basic idea of ReFusion might not be very novel. - The pilot study in Section 4.1 is very interesting and clearly exhibits how the distance affects the correlation. - The experiments, including the ablation studies, provide a very comprehensive understanding of how ReFusion works.

Weaknesses

- Similar ideas have been discussed in previous works. E.g., BD3-LM (https://arxiv.org/pdf/2503.09573) utilizes the block diffusion, EDLM (https://arxiv.org/abs/2410.21357) utilize AR models to model the correlations. The authors should clearly state the differences and their unique contribution. - I find the two-step inference method in Section 4.2 somewhat obscure. I suggest the authors reorganize the desciption into mathematical equations and add more details (e.g., how to perform positional

Reviewer 02Rating 6Confidence 4

Strengths

1. The slot abstraction plus causal infill provides an intuitive route to exact KV-cache reuse, reducing efficiency gap with AR decoding. 2. The paper provides comparison with competitive baselines, and it shows consistent wins on most tasks with both LLaDA and Dream across many tasks, supporting generality. 3. The paper probes slot thresholds and provides qualitative evidence that aligns with the design intuition.

Weaknesses

1. ReFusion introduces extra data preparation and training cost, adding pipeline complexity to realize its gains. 2. While ReFusion is much faster than prior MDMs, its throughput versus AR models is not significant better, and ReFusion may not be orthogonal to existing tricks.

Reviewer 03Rating 4Confidence 5

Strengths

1. The authors try to challenge one common belief in current dLLM literature, that we need to perform intra-block auto-regressive decoding instead of intra-block parallel decoding. The analysis is interesting, and it's also very interesting to see the challenge of common belief. 2. The proposed method is soundness, both for the training and the inference side.

Weaknesses

My primary concern lies in the unfair, insufficient, and potentially incorrect experimental evaluations, which lead to several overstated claims in the paper. * **Missing Comparison with Block Diffusion** One of the key hypotheses this paper wants to show is that intra-block autoregressive decoding + inter-block parallel decoding is superior to intra-block parallel decoding + inter-block autoregressive decoding. However, no experiments are provided to substantiate this claim. A direct comparis

Code & Models

Models

🤗
GSAI-ML/ReFusion
model· 346 dl· ♡ 13
346 dl♡ 13

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Caching and Content Delivery