TL;DR
This paper analyzes the high training variance in Masked Diffusion Models (MDMs), decomposes its sources, and proposes six variance-reduction techniques that significantly improve stability and accuracy.
Contribution
It provides the first theoretical decomposition of MDM training variance and introduces two core variance-reduction methods, advancing stable training of MDMs.
Findings
Improved accuracy by 7-8% on complex reasoning tasks.
Reduced run-to-run variability to near ARM levels.
Narrowed the performance gap between MDMs and ARMs.
Abstract
Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often…
Peer Reviews
Decision·ICLR 2026 Poster
1. Strong theoretical foundation. The paper provides a clear and principled variance decomposition for masked diffusion model (MDM) training, unifying prior ad-hoc stabilization methods under a single theoretical framework. It then builds directly on this foundation by proposing six targeted variance-reduction techniques to mitigate the identified sources of instability. 2. Comprehensive empirical validation. The experiments cover both language and multimodal domains, demonstrating that the prop
1. Narrow comparison to ARMs. The study includes only two autoregressive baselines from the same family. Incorporating additional ARM baselines, especially models with different architectures or training paradigms, would help clarify whether the observed variance gap is a general phenomenon or specific to the chosen comparison set. 2. Limited model diversity and scaling analysis. While the empirical results are solid, they are restricted to a single MDM backbone (LLaDA-8B-Instruct). Evaluating t
Overall, I think the paper studies an important problem. The proposed fixes P-Pot and Mirror are clearly argued and have demonstrated practical usefulness in terms of accuracy and training stability.
1. The loss-variance decomposition is insightful, but I believe for training stability it would be more insightful to analyze gradient variance. It would be nice to see how reductions in the proposed loss variances translate to reduced gradient variances and more stable optimization. 2. MIRROR roughly doubles the cost on some benchmarks compared to the baselines, which is quite expensive. Would MIRROR still be the best choice under a fixed time budget, which is a more practical scenario? 3. I am
- Intuitive and sharp derivation of the theorem provides a mathemetically elegant and practical explanation. - Numerical pre-experiments seems robust and adheren to the expected Pareto frontier.
- Limited experiments about generalization and comparison. The ablaition experiments are mixed in the table of comparison. The included MDM baselines are too limited. - Parts of the error bars are missing, and meanwhile, the error bars reported are too large to convince audience that the methods are consistently performing well as it's tested in the main table.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Model Reduction and Neural Networks
