Diffusion Language Models are Provably Optimal Parallel Samplers
Haozhe Jiang, Nika Haghtalab, Lijie Chen

TL;DR
This paper establishes that diffusion language models, when augmented with chain-of-thought and revision mechanisms, are provably optimal parallel samplers capable of simulating any parallel sampling algorithm efficiently.
Contribution
It provides a theoretical foundation showing DLMs with CoT and revision can simulate any parallel sampling algorithm with optimal efficiency and space complexity.
Findings
DLMs with CoT can simulate any parallel sampling with optimal steps.
Enabling remasking or revision improves DLMs' expressivity and efficiency.
Revision introduces a strict expressivity gap, making DLMs more powerful.
Abstract
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Consequently, whenever a target distribution can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps. However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks) or revision (converting unmasked tokens to other unmasked tokens) together with CoT…
Peer Reviews
Decision·ICLR 2026 Poster
Originality: The paper is original as it is (to the best of my knowledge) the first paper that analyzes the parallel sampling capabilities of DLMs through circuit complexity ideas, so it is an original combination of ideas from theoretical CS and generative modeling. Quality. The main results are rigorous and the proofs are constructive. I am not an expert on this type of results but could follow (most of) them. Theorem 3.1 provides an equivalence between circuit depth and decoding rounds. The
My main concerns are how these theoretical results relate to practical DLM training and inference; i.e. the presented results are interesting but I am not convinced they are telling us much in particular about existing DLMs and whether there is a path to exploit these results to obtain better DLMs. Expressivity and Learnability. The main results (Thms 3.1, 3.2) are existence proofs. While they show that an optimal predictor $p$ and scheduler $\mathcal{F}$ exist for specific circuit classes, th
For me the importance of the results is two fold: * First, they show the impact of remasking (or using a non-masking forward process) by showing that not only Diffusion Language Models can simulate any circuit but also by showing that the sequence length necessary to generate such circuits is limited by the width of the circuit in the case of Diffusion Language Model. * Second they highlight an example showing the superiority of Diffusion Language Models with remasking (or using a non-masking
I do not have a lot of complains about the paper. I will highlight that I am not an expert on circuits so I did find some parts of the paper hard to follow. * I would suggest to clarify the writing especially Section 4. Indeed, in this Section while I understood the results and their consequences I was unable to follow the logical structure. (Again it might be acceptable for experts in circuits but I could not follow). * Importance of the results: I am not fully convinced that being able to
- The paper is well-written and enough context is given for the reader on the notions used in the paper - The subject is interesting since diffusion LMs are more and more used so a better theoretical understanding is of importance - The circuit formulation is elegant and the proofs seem sound although I did not check all of the details in appendix - The theoretical claims are of great interest and the methodology of comparing DLMs with and without remasking and revision is well conducted - The p
I list below what I believe are weaknesses but I would be happy to be corrected if I misunderstood some parts. - The connection to distribution sampling should be made more explicit - The succession of theoretical results without much discussion on their insights for diffusion LMs hinders the contributions - Thm 3.1 is one of the main result but only an existence results. As such, it does not ensure that any DLM can simulate any distribution sampling efficiencly nor does it provide guarantees on
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
