TL;DR
The paper introduces Predictor-Corrector samplers for discrete diffusion models that outperform traditional methods in language and image tasks, with improved quality and efficiency, challenging the dominance of Masked diffusion.
Contribution
It develops a family of PC samplers applicable to arbitrary noise processes, enhancing sampling quality and efficiency in discrete diffusion models.
Findings
PC samplers outperform ancestral sampling on language and image benchmarks.
Sampling quality continues to improve with more steps using PC methods.
Memory-efficient curriculum reduces training time by 25% and memory by 33%.
Abstract
Uniform-state discrete diffusion models excel at few-step generation and guidance due to their ability to self-correct, making them preferred over autoregressive or Masked diffusion models in these settings. However, their sampling quality plateaus with ancestral samplers as the number of steps increases. We introduce a family of Predictor-Corrector (PC) samplers for discrete diffusion that generalize prior methods and apply to arbitrary noise processes. When paired with uniform-state diffusion, our samplers outperform ancestral sampling on both language and image modeling, achieving lower generative perplexity at matched unigram entropy on OpenWebText and better FID/IS scores on CIFAR10. Crucially, unlike conventional samplers, our PC methods continue to improve with more sampling steps. Taken together, these findings call into question the assumption that Masked diffusion is the…
Peer Reviews
Decision·ICLR 2026 Poster
- clear motivation for why sampling needs improvement in discrete diffusion: standard samplers can "over-commit" and cannot "self-correct" without further hacks, sometimes including adversarial training or other approximations. - the psi-sampler formulation is general and recovers previous a few predictor–corrector methods as special cases (though please careful to not say "all cases in the literature", you never know with so many papers coming out daily, I suggest "that the authors are aware o
- The curriculum section assumes prior familiarity with Sahoo 2025a and gives little intuition for why that weighted-average operation helps training. Since you are devoting nearly a whole section to this method, I ask that you at least give a few sentences on how exactly the "curriculum" technique during training uses this average computation that you are approximating. Otherwise, the speedup could be relegated to an appendix section (I think "curriculum" is used 23 times in main text without b
1. I like Figure 1 which conveys the main results of the paper convincingly. 2. The paper is well written in the sense that the background section is well formulated, the main contributions of the paper are well supported by empirical results. 3. The idea of formulating non-markovian forward processes for discrete diffusion models is quite interesting given its numerous applications in the context of continuous diffusion models in the form of DDIM.
I dont have a lot of concerns around the proposed method but rather a few suggestions for improving the presentation of the paper. **Presentation Issues** 1. Is there a reason for using the psi notation to denote distributions throughout the paper? We can probably get rid of notations and denote distributions using their standard notations like p(.) or q(.) like other works in the literature. 2. In general, a lot of intuition is missing around the sampler design in Section 3. It is not clear,
The paper derives rigorous predictor-corrector schemes for both masked and uniform state diffusion models, as well as tractable approximations for the uniform state diffusion models.
See questions.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
