Scaling Beyond Masked Diffusion Language Models

Subham Sekhar Sahoo; Jean-Marie Lemercier; Zhihan Yang; Justin Deschenaux; Jingyu Liu; John Thickstun; Ante Jukic

arXiv:2602.15014·cs.LG·February 17, 2026

Scaling Beyond Masked Diffusion Language Models

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun, Ante Jukic

PDF

Open Access

TL;DR

This paper investigates the scaling behavior of discrete diffusion language models, revealing that perplexity alone is insufficient for comparison and that alternative models can outperform masked diffusion in practical tasks.

Contribution

It presents the first scaling law analysis of uniform-state and interpolating diffusion methods and shows masked diffusion can be made more efficient with simple training modifications.

Findings

01

Perplexity is informative within a diffusion family but misleading across families.

02

Masked diffusion models can be improved by approximately 12% in FLOPs efficiency.

03

Uniform-state diffusion remains competitive on likelihood benchmarks and excels in practical tasks like GSM8K.

Abstract

Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Generative Adversarial Networks and Image Synthesis