Scaling up Masked Diffusion Models on Text
Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng,, Min Lin, Chongxuan Li

TL;DR
This paper introduces scaling laws for Masked Diffusion Models (MDMs) in language tasks, demonstrating their competitive performance and efficiency compared to autoregressive models, especially at large scales.
Contribution
It establishes the first scaling law for MDMs, trains large MDMs up to 1.1B parameters, and proposes an unsupervised guidance method to enhance conditional inference.
Findings
MDMs scale similarly to autoregressive models with a small compute gap.
A 1.1B MDM outperforms a similarly sized TinyLlama in zero-shot benchmarks.
MDMs with 16x pre-training time match ARMs in performance and are 1.4x faster during sampling.
Abstract
Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, the 1.1B MDM outperforms the 1.1B TinyLlama model trained on the same data across…
Peer Reviews
Decision·ICLR 2025 Poster
-Exploring diffusion models for text is an interesting direction. The authors' work is well motivated. -The gains from Unsupervised CFG seem consistent across the 8 tasks.
-I believe comparing to a a text-based LLM that uses bidrectional masking/span corruption is more fair than a left-to-right next token prediction ARM model like GPT2. This would disentangle the value of the masking (which can also be done in language models) from the diffiusion aspect. Right now I think these two factors are conflated. For example are the reverse curve results in Table 6 because the MDM uses masking or because its a diffusion model? -The novelty of the work is also modest.
1. The paper scaled up the Masked Diffusion Model(MDM) for text, and experiment are performed in a solid to demonstrate the strength (bidirectional modeling of text), and weakness (has to consume more computation to reach the performance of AR language model). 2. The paper provided the research community with understanding of MDM in term of scaling and future research directions of more computationally efficient MDM, which is a significant contribution. 3. The paper is written in a clear an
1. The major theoretical basis are given in the previous work [1], this work only performed experiments to scale up, which makes this paper less impressive and novel. 2. In the reverse curse experiment (Table 6), it seems that MDM performance still decay in reverse direction in two dataset. Given the model encodes the text in a bidirectional way, how we can explain this decay ? 3. There are minor typo problems. [1] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributio
- The introduction of scaling law for MDMs trained from scratch suggests that they show the same scaling potential as ARMs. - The presentation is clear and well-organized, with only minor typos. - Results on reversal curse tasks of MDMs is strikingly better than ARMs. This is an interesting point.
\textbf{Unfair Comparison:} Line 368-369: The authors extend MDM pre-training time by a factor of 16 to achieve a “meaningful” comparison with ARMs. However, a more valid comparison would involve MDMs and ARMs trained with equivalent compute budgets or trained to convergence for a given model size. Comparing an MDM to an ARM trained with only 1/16 of its FLOPs does not yield a balanced assessment. \textbf{Missing Citations:} The paper lacks a related work section, making it difficult for reader
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · {Dispute@FaQ-s}How to file a dispute with Expedia? · Dropout · Byte Pair Encoding · Layer Normalization · Residual Connection · Cosine Annealing · Attention Is All You Need
