Simplified and Generalized Masked Diffusion for Discrete Data
Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, Michalis K. Titsias

TL;DR
This paper introduces a simplified, unified framework for masked diffusion models that improves discrete data modeling, achieving state-of-the-art results in language and image generation tasks by leveraging a continuous-time variational objective.
Contribution
The work presents a simple, general framework for masked diffusion models, clarifies their theoretical foundation, and demonstrates superior performance on language and image benchmarks.
Findings
Outperforms prior diffusion language models on perplexity and zero-shot tasks.
Achieves state-of-the-art bits per dimension on CIFAR-10 and ImageNet 64x64.
Provides a unified, theoretically grounded approach to masked diffusion modeling.
Abstract
Masked (or absorbing) diffusion is actively explored as an alternative to autoregressive models for generative modeling of discrete data. However, existing work in this area has been hindered by unnecessarily complex model formulations and unclear relationships between different perspectives, leading to suboptimal parameterization, training objectives, and ad hoc adjustments to counteract these issues. In this work, we aim to provide a simple and general framework that unlocks the full potential of masked diffusion models. We show that the continuous-time variational objective of masked diffusion models is a simple weighted integral of cross-entropy losses. Our framework also enables training generalized masked diffusion models with state-dependent masking schedules. When evaluated by perplexity, our models trained on OpenWebText surpass prior diffusion language models at GPT-2 scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage and Signal Denoising Methods
MethodsAttention Is All You Need · Cosine Annealing · Layer Normalization · Weight Decay · Linear Warmup With Cosine Annealing · Linear Layer · Byte Pair Encoding · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout
