Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training
Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, Sitan Chen

TL;DR
This paper introduces Progressive Unmasking (PUMA), a training modification for Masked Diffusion Models that aligns training and inference masks, significantly speeding up pretraining and improving efficiency.
Contribution
PUMA is a simple masking strategy that reduces training complexity and aligns training with inference patterns in Masked Diffusion Models.
Findings
PUMA speeds up pretraining by approximately 2.5 times at 125M scale.
PUMA improves training efficiency when combined with autoregressive initialization.
PUMA reduces the mismatch between training and inference masks.
Abstract
Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train--test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques
