Soft-Masked Diffusion Language Models

Michael Hersche; Samuel Moor-Smith; Thomas Hofmann; Abbas Rahimi

arXiv:2510.17206·cs.LG·March 3, 2026

Soft-Masked Diffusion Language Models

Michael Hersche, Samuel Moor-Smith, Thomas Hofmann, Abbas Rahimi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces soft-masking, a novel method for diffusion language models that blends embeddings to better utilize predictive information, resulting in improved performance over traditional binary masking approaches.

Contribution

The paper proposes soft-masking for diffusion language models, enhancing context preservation and predictive accuracy, with effective training strategies for both from-scratch and pretrained models.

Findings

01

Soft-masking improves perplexity and MAUVE scores.

02

Enhanced performance on coding benchmarks.

03

Effective for both training from scratch and fine-tuning.

Abstract

Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-k predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* The motivation is strong. The work addresses a key weakness of masked diffusion LLMs - where the state is a binary decision and cannot represent superpositions of different tokens. * The proposed fix is sensible and do not change the formulation of the probabilistic model, i.e., the model can still be trained with standard ELBO objective, with additional modification in the input of the neural network. * The empirical improvements over pure mask models in continued pretraining is very stro

Weaknesses

* The proposed method requires two evaluations of the denoiser network in each training iteration. This makes the comparison to pure masked models unfair (the latter only need one network evaluation) given the same batch size. I would expect to see a comparison that matches the training flops. * There is a very closely-related prior method "self-conditioning" that the authors failed to prominently highlight. Although it is briefly mentioned in the related work, given the smilarity of the two a

Reviewer 02Rating 4Confidence 3

Strengths

- The presentation is clear and easy to follow. - Although the information blank issue has been observed by concurrent works, I find the idea of using the mixture of top-k token indexes as mask token embedding is novel. - The authors validate the effectiveness of the proposed SM method through text and code generation experiments, which supports their claim.

Weaknesses

- The training methodology in Section 3.2 lacks a detailed derivation. Since the authors introduce an embedding for masked tokens, both the forward process and backward process have been changed. The authors then propose to use the two-pass method to train the the new SM model. However, this training method is rather heuristic and the authors do not provide any theoretical analysis to explain what we exactly do in the training process (e.g., maximizing the likelihood?) - The formulation in Line

Reviewer 03Rating 8Confidence 3

Strengths

I'm very positive on this paper! The idea is very simple, but the authors conducted a fairly comprehensive amount of evaluations under various reasonable configurations and showed tangible performance gains. Overall I think this is a good paper and thus lean towards acceptance, although there are a still a few unanswered questions I would like to see answered.

Weaknesses

Overall I think the paper is well executed. There are a few things that I'd like to also see to get a better idea of the limitations and benefits of this method, I'll defer those to the questions section below.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods · Topic Modeling