Support Before Frequency in Discrete Diffusion
Adrian M\"uller, Antoine Gonon, Zebang Shen, Ya-Ping Hsieh, Niao He

TL;DR
This paper shows that discrete diffusion models learn the structure of data support before mastering the finer details of data frequencies, with implications for their denoising objectives and learning hierarchy.
Contribution
It provides a theoretical analysis of how different diffusion mechanisms organize learning, highlighting the support-frequency hierarchy in discrete diffusion models.
Findings
Support structure is learned before frequency details.
Uniform and absorbing diffusion exhibit different validity-improving behaviors.
Experiments confirm the predicted hierarchy and mechanism-dependent differences.
Abstract
Discrete diffusion models are increasingly competitive for language modeling, yet it remains unclear how their denoising objectives organize learning. Although these objectives target the full data distribution, we show that the exact reverse process induces a hierarchy between coarse support information and finer frequency information. For uniform and absorbing (a.k.a. masking) diffusion, we prove that, in the small-noise regime of the final denoising steps, each single-token reverse edit decomposes into a leading scale, determined by whether it moves toward the data support (e.g., grammatically valid sentences), and a finer coefficient, determining relative probabilities within the same scale. Thus, recovering validity structure only requires learning the correct order of magnitude of reverse probabilities, whereas recovering data frequencies requires coefficient-level estimation. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
