Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes
Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, Jiacheng Sun

TL;DR
This paper introduces Deletion-Insertion Diffusion models (DID) for language modeling, which improve efficiency and flexibility over masked diffusion models by using token deletion and insertion as diffusion processes, enabling variable-length sequences and self-correction.
Contribution
The paper proposes a novel diffusion framework based on token deletion and insertion, replacing masking, with a score-based training method and dynamic programming for efficient training, improving over existing models.
Findings
DID outperforms baselines in modeling performance and sampling quality.
DID achieves faster training and inference speeds.
DID supports variable-length sequences without padding.
Abstract
While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) <MASK> tokens inherent to the paradigm, and 2) <PAD> tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism…
Peer Reviews
Decision·ICLR 2026 Poster
The overall idea of the proposed method is creative and explores a new direction for language diffusion models. I see the main strengths of the paper as follows: - method: the shift from masking to pure deletion/insertion for diffusion LMs is intuitive, and the mathematical formulation of the insertion score and the derivation of the objective via subsequence counts seem correct - efficiency: the paper makes a strong case for the inefficiency of MDLMs due to mask and pad tokens, and the reporte
I believe there are a few weaknesses: - experiments: the experiments are relatively smalls-scale (OpenWebText, Stories) compared to state-of-the-art LLM research; while this is fine for providing a proof of concept, it's unclear if the reported efficiency gains hold or if new instabilities arise at larger scales - complexity: this may be a nitpick, but the DP approach seems much more complicated and harder to implement than, say, standard cross-entropy or MDLM objectives
Clear, well-motivated reformulation that naturally supports variable length; practical training and sampling mechanics that fit modern accelerators; consistent efficiency gains with competitive quality; writing is clear and limitations are acknowledged.
Evidence is concentrated on relatively small models and moderate sequence lengths, so scalability to long contexts and larger models is uncertain; evaluation leans on automatic metrics with a single external scorer and no human assessment; baseline alignment (steps, precision, compute) and ablations could be tighter to isolate where gains come from; the added bookkeeping likely introduces overhead whose impact isn’t fully profiled.
- DID is a novel, theoretically grounded approach that tackles the important problem of variable-length generation of texts using discrete diffusion, bridging an important gap to the traditional AR methods. - The technical details are sound with a self-contained derivation. - The numerical experiments are comprehensive, and the results are promising.
- While the experiments are comprehensive, the results could be made more convincing by comparing with more discrete diffusion model baselines other than MDM, such as MDM-prime [1], Block Diffusion [2], etc. - The paper could use more space to discuss differences & advantages compared with other variable-length discrete diffusion approaches. The existing discussion is nice, but it's not entirely clear what the advantage of DID is compared to existing approaches, especially EditFlow [3], which a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
