Variational Masked Diffusion Models

Yichi Zhang; Alex Schwing; Zhizhen Zhao

arXiv:2510.23606·cs.LG·October 28, 2025

Variational Masked Diffusion Models

Yichi Zhang, Alex Schwing, Zhizhen Zhao

PDF

3 Reviews

TL;DR

Variational Masked Diffusion (VMD) introduces latent variables into masked diffusion models to better capture token dependencies, improving generation quality and consistency across synthetic, puzzle, and text datasets.

Contribution

VMD is the first framework to explicitly model token dependencies in masked diffusion models using variational inference.

Findings

01

VMD learns dependencies that standard masked diffusion cannot capture.

02

VMD improves global consistency in Sudoku and text datasets.

03

VMD enhances generation quality by modeling token dependencies.

Abstract

Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively capture dependencies among tokens that are predicted concurrently, leading to degraded generation quality when dependencies among tokens are important. To explicitly model dependencies among tokens, we propose Variational Masked Diffusion (VMD), a framework that introduces latent variables into the masked diffusion process. Through controlled experiments on synthetic datasets, we demonstrate that VMD successfully learns dependencies that conventional masked diffusion fails to capture. We further validate the effectiveness of our approach on Sudoku puzzles and text datasets, where learning of dependencies among tokens improves global consistency. Across these domains, VMD enhances both generation…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The presentation of this paper is clear, and it's easy for the readers to understand the whole methodology. - The extension to the block diffusion and remasking scheme is straightforward but meaningful. - The experimental design in Sections 4.1 and 4.2 can clearly demonstrate the VMD's ability to modeling dependencies, and can also be viewed as a contribution.

Weaknesses

- The methodology of VMD is almost identical to the VADD model (https://arxiv.org/abs/2505.17384). As far as I know, VADD is the first work to consider using a latent variable model to define the transition probability $p_\theta(x_0|x_t)$, using a VAE framework for training, and discussing the related sampling framework. Specifically, (a) model definition. Equation (3) in VMD is similar to equation (6) in VADD, (b) training objective. Equation (5) in VMD is similar to equation (9

Reviewer 02Rating 6Confidence 3

Strengths

The paper studies an interesting problem and presents it well with a nice exposition. The topic is also important as it might reduce inference time while maintaining quality outputs. I think this work, either by itself or through its follow-ups, can impact real-world large language models.

Weaknesses

I did not find major weaknesses in this work. Some minor questions are given below. Additional minor questions/comments on writing are provided under "questions." **W1. Number of tokens per block**: Since VMDs have key advantages on sequences with high inter-token dependency, why are the experiments limited to at most 2 tokens per block? Can VMDs work with multiple tokens with strong dependencies between tokens that are located far from each other? Maybe on a needle-in-a-haystack type of datase

Reviewer 03Rating 4Confidence 5

Strengths

The idea of tacking token dependency through a latent variable seems correct, as it provides more information on the posterior distribution that cannot be given solely by the partially masked sequence. Also, the ELBO object is theoretically grounded, resulting in a reasonable training loss. Moreover, the experiments are well-designed to show that the VDM indeed captures the token dependency much better.

Weaknesses

The experimental claim in this paper is apparently weak. 1. In the synthetic dataset, the experiment is well-designed and 2. In the text data, although I appreciate the author's effort on pretraining VDM from scratch, the small difference in generative perplexity (Table 5) isn't enough to tell that VDM is much better than MDM in capturing the token dependency. 3. In the Sudoku puzzle, the VDM's accuracy is also marginally better than baseline, e.g., Top Prob margin. Given that it's a small-sca

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.