Auto-Regressive Masked Diffusion Models

Mahdi Karami; Ali Ghodsi

arXiv:2601.16971·cs.LG·January 26, 2026

Auto-Regressive Masked Diffusion Models

Mahdi Karami, Ali Ghodsi

PDF

Open Access

TL;DR

The paper introduces ARMD, a novel architecture that unifies the efficiency of autoregressive models with diffusion models, enabling faster training and inference for language modeling while achieving state-of-the-art results.

Contribution

ARMD reframes masked diffusion as a causal model, enabling parallel training and decoding, and introduces a new parallel generation strategy for improved efficiency.

Findings

01

Achieves state-of-the-art performance on language benchmarks.

02

Requires fewer training steps than previous diffusion models.

03

Enables parallel text generation with maintained coherence.

Abstract

Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods