Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity
Jianhao Huang, Baharan Mirzasoleiman

TL;DR
This paper investigates how masked diffusion language models generalize, revealing that their implicit regularization can be tuned to improve performance and prevent grokking, especially on large-scale models.
Contribution
The work provides a theoretical decomposition of the MD objective into signal and noise regimes and demonstrates how tuning mask distribution enhances generalization and scalability.
Findings
MD objective enables rapid generalization without grokking.
Optimizing mask probability improves perplexity significantly.
Method scales effectively to large models, outperforming baselines.
Abstract
Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain understudied compared to their auto-regressive counterparts. In this work, we investigate these properties within the setting of the -parity problem (computing the XOR sum of relevant bits), where neural networks typically exhibit grokking -- a prolonged plateau of chance-level performance followed by sudden generalization. We theoretically decompose the Masked Diffusion (MD) objective into a Signal regime which drives feature learning, and a Noise regime which serves as an implicit regularizer. By training nanoGPT using MD objective on the -parity problem, we demonstrate that MD objective fundamentally alters the learning landscape, enabling rapid and simultaneous generalization without experiencing grokking. Furthermore, we leverage our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Ferroelectric and Negative Capacitance Devices · Block Copolymer Self-Assembly
