Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty
Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin

TL;DR
This paper introduces Denoising Entropy, a new metric for quantifying uncertainty in Masked Diffusion Models, and proposes algorithms that optimize decoding paths to enhance generation quality across various challenging benchmarks.
Contribution
The paper formalizes the impact of decoding order variability in MDMs, introduces Denoising Entropy as a quantifiable uncertainty measure, and develops algorithms that leverage this metric for improved generation.
Findings
Entropy-guided methods improve generation quality
Significant accuracy boosts on reasoning, planning, and code benchmarks
Denoising Entropy effectively guides decoding path optimization
Abstract
Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper is fairly well-written with an abundance of figure to aid with the communication of ideas - The authors show that their approaches E-BoN and E-SMC manage to increase the average accuracy when paired with any suite of decoding approaches. - The authors provide a theoretical justification for their newly proposed decoding approach - The experiments cover a range of tasks and models
- The authors use GPT2 to evaluate the perplexity of generations. I expected to see the perplexity reported using a much more capable LM given the current landscape. - As the authors might know, in LLMs, greedy decoding and beam search, which by definition attempts to approximate the lowest-uncertainty generation path, tends to perform quite poorly and are typically avoided as a decoding approach. I would've expected a discussion of how one is to consolidate these well established finding in th
- The paper studies an important problem — optimizing the decoding path in Masked Diffusion Models — and introduces a relatively simple method that scales inference-time computation to achieve notable performance gains. - The paper also provides extensive experiments evaluating the proposed method across a variety of tasks. It is also nice that the paper attempts to offer a theoretical justification for the proposed approach (though, as noted below, there are some issues with it).
- Propositions 1 and 2 rely on several assumptions that require further justification. In particular, both propositions appear to assume $p_{\theta}(x_0^{\ell} | z_t) = q(x_0^{\ell} | z_t)$. This is a strong assumption in the context of optimizing decoding paths for MDMs. If the learned posterior equals the true data posterior, then all decoding paths would yield the same distribution (see page 7 in [1] for an explanation). This makes the assumption a bit unrealistic, and the practical implicati
1. **Elegant and Principled Framework.** The formulation of *path uncertainty* and *denoising entropy* is mathematically clean and conceptually satisfying. It connects diffusion decoding with information-theoretic measures in a natural way. 2. **Strong Theoretical Justification.** The proofs that hDE bounds joint entropy and approximates per-token loss are technically sound, providing a rare formal underpinning for an uncertainty metric in MDMs. 3. **Simple but Effective Algorithms.** E-BON and
**Empirical Scope and Novelty of Gains.** While the method consistently improves over baselines, the absolute improvements (typically 1–2% accuracy gains on large reasoning models) may be modest given the added complexity. **Limited Computational Analysis.** The results primarily focus on accuracy or perplexity; runtime or budget trade-offs for E-SMC versus simpler strategies are not quantified, though SMC is known to be resource-intensive.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning · Topic Modeling
