Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

Dhruvesh Patel; Aishwarya Sahoo; Avinash Amballa; Tahira Naseem; Tim G. J. Rudner; Andrew McCallum

arXiv:2505.05755·cs.CL·September 4, 2025

Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions

Dhruvesh Patel, Aishwarya Sahoo, Avinash Amballa, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum

PDF

Open Access 1 Models 5 Datasets 3 Reviews

TL;DR

This paper introduces Insertion Language Models (ILMs), a new sequence generation approach that inserts tokens at arbitrary positions, enabling better modeling of complex dependencies and flexible sequence infilling compared to traditional autoregressive and masked diffusion models.

Contribution

ILMs are the first models to learn to insert tokens at arbitrary positions, jointly selecting position and token, improving sequence modeling and infilling flexibility.

Findings

01

ILMs outperform ARMs and MDMs on planning tasks.

02

ILMs match ARMs in unconditional text generation.

03

ILMs excel in arbitrary-length text infilling.

Abstract

Autoregressive models (ARMs), which predict subsequent tokens one-by-one ``from left to right,'' have achieved significant success across a wide range of sequence generation tasks. However, they struggle to accurately represent sequences that require satisfying sophisticated constraints or whose sequential dependencies are better addressed by out-of-order generation. Masked Diffusion Models (MDMs) address some of these limitations, but the process of unmasking multiple tokens simultaneously in MDMs can introduce incoherences, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled in is not known in advance. In this work, we introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence -- that is, they select jointly both the position and the vocabulary element to be inserted. By inserting tokens one…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper makes an original contribution by adapting insertion-based sequence generation to general language modeling, allowing models to generate sequences of arbitrary length. This is a practical extension of existing techniques, expanding their applicability beyond fixed-length generation tasks. The work is of clear presentation and well-structured experiments on both synthetic and real-world datasets. Results show consistent improvements over strong baselines such as Autoregressive and Mask

Weaknesses

The paper’s technical novelty is somewhat limited, as its main contribution lies in applying an existing insertion-based sequence generation technique to the problem of variable-length text generation. While this adaptation is practical, the paper does not sufficiently deepen the theoretical or conceptual understanding of insertion-based generation, nor does it clearly articulate the unique challenges encountered when extending this approach to general language modeling. A more thorough analysis

Reviewer 02Rating 6Confidence 3

Strengths

The paper successfully modernizes the classical idea of insertion-based language models by combining it with denoising objectives. The performance improvements are convincingly demonstrated on synthetic tasks, clearly showing the advantages of the proposed approach. As a reference, it may be helpful to also cite: - Insertion-based Decoding with automatically Inferred Generation Order, https://arxiv.org/abs/1902.01370

Weaknesses

- **Method: Fundamental Inefficiency** The primary limitation of ILMs is that they sacrifice parallelization benefits from both ARMs and MDMs. Unlike ARMs, ILMs cannot leverage efficient parallel training, and unlike MDMs, they do not support parallel inference. However, despite giving up these parallelization advantages, the improvement in generative perplexity is not substantial (Table 2, LM1B). - **Experiment Setup: NLL Measurement** NLL measures how accurately a model captures the learned

Reviewer 03Rating 6Confidence 4

Strengths

1. The authors clearly articulate the shortcomings of existing ARMs (sequential bias) and MDMs (fixed-length masks, simultaneous unmasking). ILMs offer a different approach with many benefits, and the authors' presentation of this is clear. 2. The experiments are well chosen and convincing. The paper evaluates ILMs on synthetic planning tasks (star graphs, zebra puzzles) and on realistic text datasets (LM1B, TinyStories). ILMs outperform ARMs and MDMs on constrained reasoning tasks and perform

Weaknesses

1. Not allowing caching is a pretty big deal for incremental generative models. I understand that it is future work, but it could be nice to discuss a bit what the possible avenues are toward efficient ILM inference? 2. I always feel bad saying this, but it would be nice to see how results change at larger scales. I know this is easier said than done, and scaling up experiments is usually the goal anyway. However, my point is that sometimes the extra flexibility/expressivity becomes harder for

Code & Models

Models

🤗
dhruveshpatel/ilm-owt
model· 205 dl
205 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Games · Multimodal Machine Learning Applications

MethodsDiffusion