Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models
Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, Christos Louizos

TL;DR
This paper investigates the context comprehension abilities of Masked Diffusion Language Models (MDLMs), revealing their limitations in handling distant information and the distracting effect of mask tokens, and proposes a new training method to improve robustness.
Contribution
The paper identifies key limitations of MDLMs in context understanding and introduces a mask-agnostic loss function to enhance their robustness against masking distractions.
Findings
MDLMs exhibit a strong locality bias despite bidirectional training.
Appending many mask tokens degrades context comprehension.
The proposed loss function improves model robustness to masking effects.
Abstract
Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The work demonstrates solid contribution by uncovering understudied limitations of MDLMs: locality bias and mask distraction that were not systematically explored in prior masked dLLM research. This fills a critical gap, as MDLMs are often assumed to leverage global context more uniformly than ARLMs. The mask-agnostic loss offers a practical, architecture-agnostic fix for MDLM robustness, and the evaluation guidelines address reproducibility issues in existing MDLM benchmarks (where mask conf
1. the analysis is limited to two open-source MDLMs (LLaDA-8B, Dream-7B) with opaque pre-training details (e.g., exact datasets, masking schedules). This makes it hard to disentangle model-specific quirks (e.g., Dream’s ARLM initialization) from general MDLM properties. Testing additional controlled pre-trained variants would strengthen generalizability. 2. the few-shot tasks used to measure locality bias are relatively simple (e.g., choosing adjectives); testing on more complex reasoning tasks
* This paper addresses an important topic---the context comprehension capability of emerging diffusion LLMs. * The authors took a scientific approach by designing thoughtful experiments and iteratively refining the hypothesis based on the results. For the study of locality bias, the authors first establish a hypothesis of recency bias based, then disentangled it from the left-to-right order by new experiments that moves the mask position. A similar approach has been taken towards the study of
* The fact that diffusion LLMs exhibit a locality bias has been similarly discovered in a prior work (DiffuCoder), diminishing the importance of the discovery. * Although the paper takes a scientific approach in experiment design and hypothesis checking, it offers limited insights on the mechanisms. The experiments are mostly designed to test the whether the model performance is impacted by a factor rather than exposing why they are impacted. * The study relies on pretrained OSS diffusion LLM
This paper presents some evidence that aligns with existing intuitions. For instance, the unmasked tokens can reduce the uncertainty of nearby masks, and the more masks we have the less accurate the predictions would be due to higher uncertainty. It’s nice to see some concrete analysis to support these intuitions.
1. The paper's conclusions align with intuition: surrounding unmasked tokens likely provide more contextual information, aiding in the unmasking process. This phenomenon could be further explained by examining the role of attention in Transformer models. However, the analysis of locality bias would be more convincing if the variable lengths of sequences were normalized or fixed. This would help control for the difference between context length and the actual sequence lengths of samples, which is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Generative Adversarial Networks and Image Synthesis
