MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models
Haoyu He, Katrin Renz, Yong Cao, Andreas Geiger

TL;DR
This paper introduces MDPO, a reinforcement learning-based method that aligns training and inference in masked diffusion language models, significantly improving performance with fewer updates and enabling flexible token refinement.
Contribution
We propose MDPO, a novel training approach that explicitly matches the inference process of MDLMs, reducing training complexity and enhancing model performance.
Findings
MDPO achieves state-of-the-art results with 60x fewer gradient updates.
MDPO improves performance on MATH500 and Countdown benchmarks.
Running Confidence Remasking (RCR) enhances inference flexibility and overall performance.
Abstract
Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The reward function definition that encourages improvement in each denoising steps is interesting (and novel I believe). However, a thorough ablation is missing to demonstrate the usefulness of including such a term - how does it compare to just optimizing the total reward? I will consider increasing the score if this is included in the rebuttal phase. * The observation made about "remasking" strategy in LLaDA is interesting - such remasking only happens in the same step where the same token
* The authors seem to interchangeably use the terms "masked diffusion" and "diffusion language models" which are related but not exactly same concepts. Masked diffusion (notably used for text but not only text, e.g., D3PM and MD4 apply to images too) did not use confidence-based inference nor remasking. The confidence-based inference and remasking was first introduced in the language modeling context by LLaDA. Therefore, I find the name of the method (MDPO) as well as many statements such as "MD
The paper is well written and presented, the general set-up seems reasonable and I believe the authors claims are generally supported by their experiments. The topic of de-noising and how it relates to down-stream tasks w.r.t performance and efficiency is a useful area of research for diffusion models, and this work poses a timely addition.
The main concern I have with this work is that only two tasks are evaluated. While their current evaluation seems reasonable to me, having only two tasks and being limited to only verifiable tasks really limits the scope of this work. The authors do partially address this in the work, but in my opinion I believe there needs to be some sort of analysis on tasks that are not easily verifiable to quantify possible error modes/limitations of MDPO as a function of the "difficulty to verify" the task,
1. The identification of the over-denoising phenomenon seems interesting. The paper proposes a policy optimization approach to mitigate the problem and a corresponding decoding method that can further enhance the performance. 2. The method proposed is easy to follow and the flow of the paper is clean.
1. The experiment is only performed on LLaDA-Instruct on two tasks which is not comprehensive enough. 2. The proposed policy optimization algorithm lacks theoretical insights. It is unclear how this objective can optimize the policy towards a more favorable one and what is the relationship between the reward model and the optimized policy. 3. Similarly, the running confidence remasking (RCR) is also proposed in a rather heuristic way. It is unclear how RCR is related to enhance performance and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
