Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions
Dhruvesh Patel, Aishwarya Sahoo, Avinash Amballa, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum

TL;DR
This paper introduces Insertion Language Models (ILMs), a new sequence generation approach that inserts tokens at arbitrary positions, enabling better modeling of complex dependencies and flexible sequence infilling compared to traditional autoregressive and masked diffusion models.
Contribution
ILMs are the first models to learn to insert tokens at arbitrary positions, jointly selecting position and token, improving sequence modeling and infilling flexibility.
Findings
ILMs outperform ARMs and MDMs on planning tasks.
ILMs match ARMs in unconditional text generation.
ILMs excel in arbitrary-length text infilling.
Abstract
Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,'' have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence -- that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper makes an original contribution by adapting insertion-based sequence generation to general language modeling, allowing models to generate sequences of arbitrary length. This is a practical extension of existing techniques, expanding their applicability beyond fixed-length generation tasks. The work is of clear presentation and well-structured experiments on both synthetic and real-world datasets. Results show consistent improvements over strong baselines such as Autoregressive and Mask
The paper’s technical novelty is somewhat limited, as its main contribution lies in applying an existing insertion-based sequence generation technique to the problem of variable-length text generation. While this adaptation is practical, the paper does not sufficiently deepen the theoretical or conceptual understanding of insertion-based generation, nor does it clearly articulate the unique challenges encountered when extending this approach to general language modeling. A more thorough analysis
The paper successfully modernizes the classical idea of insertion-based language models by combining it with denoising objectives. The performance improvements are convincingly demonstrated on synthetic tasks, clearly showing the advantages of the proposed approach. As a reference, it may be helpful to also cite: - Insertion-based Decoding with automatically Inferred Generation Order, https://arxiv.org/abs/1902.01370
- **Method: Fundamental Inefficiency** The primary limitation of ILMs is that they sacrifice parallelization benefits from both ARMs and MDMs. Unlike ARMs, ILMs cannot leverage efficient parallel training, and unlike MDMs, they do not support parallel inference. However, despite giving up these parallelization advantages, the improvement in generative perplexity is not substantial (Table 2, LM1B). - **Experiment Setup: NLL Measurement** NLL measures how accurately a model captures the learned
1. The authors clearly articulate the shortcomings of existing ARMs (sequential bias) and MDMs (fixed-length masks, simultaneous unmasking). ILMs offer a different approach with many benefits, and the authors' presentation of this is clear. 2. The experiments are well chosen and convincing. The paper evaluates ILMs on synthetic planning tasks (star graphs, zebra puzzles) and on realistic text datasets (LM1B, TinyStories). ILMs outperform ARMs and MDMs on constrained reasoning tasks and perform
1. Not allowing caching is a pretty big deal for incremental generative models. I understand that it is future work, but it could be nice to discuss a bit what the possible avenues are toward efficient ILM inference? 2. I always feel bad saying this, but it would be nice to see how results change at larger scales. I know this is easier said than done, and scaling up experiments is usually the goal anyway. However, my point is that sometimes the extra flexibility/expressivity becomes harder for
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Games · Multimodal Machine Learning Applications
MethodsDiffusion
