Auto-Regressive Masked Diffusion Models
Mahdi Karami, Ali Ghodsi

TL;DR
The paper introduces ARMD, a novel architecture that unifies the efficiency of autoregressive models with diffusion models, enabling faster training and inference for language modeling while achieving state-of-the-art results.
Contribution
ARMD reframes masked diffusion as a causal model, enabling parallel training and decoding, and introduces a new parallel generation strategy for improved efficiency.
Findings
Achieves state-of-the-art performance on language benchmarks.
Requires fewer training steps than previous diffusion models.
Enables parallel text generation with maintained coherence.
Abstract
Masked diffusion models (MDMs) have emerged as a promising approach for language modeling, yet they face a performance gap compared to autoregressive models (ARMs) and require more training iterations. In this work, we present the Auto-Regressive Masked Diffusion (ARMD) model, an architecture designed to close this gap by unifying the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models. Our key insight is to reframe the masked diffusion process as a block-wise causal model. This perspective allows us to design a strictly causal, permutation-equivariant architecture that computes all conditional probabilities across multiple denoising steps in a single, parallel forward pass. The resulting architecture supports efficient, autoregressive-style decoding and a progressive permutation training scheme, allowing the model to learn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods
