Scaling Behavior of Discrete Diffusion Language Models
Dimitri von R\"utte, Janis Fluri, Omead Pooladzandi, Bernhard Sch\"olkopf, Thomas Hofmann, Antonio Orvieto

TL;DR
This paper investigates the scaling laws of discrete diffusion language models, revealing how different noise types affect their efficiency and demonstrating the successful training of a 10-billion-parameter uniform diffusion model.
Contribution
It provides the first comprehensive analysis of DLM scaling behavior across noise types and introduces a large-scale uniform diffusion model, highlighting its potential advantages.
Findings
Scaling behavior varies significantly with noise type.
Uniform diffusion models are more data-efficient in compute-bound regimes.
Trained the largest known uniform diffusion model with 10B parameters.
Abstract
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we…
Peer Reviews
Decision·ICLR 2026 Poster
* This work studies an important topic that is of interest to the community working on discrete diffusion models for language. The methodology taken is generally sound except a few design choices explained below. * Some experiment design choices are justified through ablations, e.g., the omission of annealing phase to save a search dimension. * The conclusion drawn from the fitted scaling law is surprising and if verified, could lead to a paradigm shift in the discrete diffusion community. It
* The writing needs significant improvement - currently the paper is lacking proper introduction of background materials - just citing them is not sufficient. For example, “ To support both isotropic and anisotropic denoising, we implement diffusion forcing (Chen et al., 2024) by sampling independent per-token noise levels for 50% of samples” the authors need to define what is “isotropic” and “anisotropic” denoising and introducing the diffusion forcing method. It is not immediately clear why su
This work is well presented and covers a reasonable range of noise schedules, model sizes, and various hyper parameters. Scaling models are quite important to guide future work and this is work is executed well enough to be generally impactful and useful. While I have some concerns regarding some evals I believe are missing, I believe this work could be a timely and useful addition to the community if these shortcomings are addressed.
This work is mainly limited by only presented results in terms of the ELBO and not performing any sort of downstream evaluation. This limits any comparisons to ALMs and allows for possible confounding where different mixing distributions during training may have different performance characteristics during downstream evaluation. The GIDD work this work cites a fair bit provides a reasonable set-up to perform downstream eval that would benefit this work immensely. In particular, one thing I would
It is an insightful idea to unify the formulations of uniform and masked diffusion models via reparameterizing Signal-Noise-Ratio (SNR), which could help to explore new types of diffusion models in future work.
1. The authors sometimes made very strong claims without proper scientific support. For example, in Line#108, “... showing that training with and without annealing yields similar optima and a similar loss, up to some constant factor”. However, it turns out that the authors only compared the shapes of the loss curves of two different models through visual inspection, without any statistical analysis (Line#361 and 362). 2.The analysis of optimal batch size and optimal step count is confusing. In
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Topic Modeling
