Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu; Lexington Whalen; Zhifan Ye; Xin Dong; Shizhe Diao; Jingyu Liu; Chengyue Wu; Hao Zhang; Enze Xie; Song Han; Maksim Khadkevich; Jan Kautz; Yingyan Celine Lin; Pavlo Molchanov

arXiv:2512.14067·cs.CL·May 1, 2026

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu, Lexington Whalen, Zhifan Ye, Xin Dong, Shizhe Diao, Jingyu Liu, Chengyue Wu, Hao Zhang, Enze Xie, Song Han, Maksim Khadkevich, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

PDF

2 Models

TL;DR

This paper introduces a novel AR-to-dLM conversion method that enhances diffusion language models' speed and accuracy by preserving pretrained AR weights and employing a position-dependent masking strategy.

Contribution

It proposes a continuous pretraining scheme with block-wise attention and a position-dependent masking strategy, significantly improving dLM efficiency and performance.

Findings

01

Efficient-DLM 8B outperforms Dream 7B and Qwen3 4B in accuracy and throughput.

02

Maintaining pretrained AR weight distributions is crucial for effective AR-to-dLM conversion.

03

The proposed methods lead to state-of-the-art results in speed and accuracy for diffusion language models.

Abstract

Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing principles and methodologies for more effective AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is critical for effective AR-to-dLM conversion. As such, we introduce a continuous pretraining scheme with a block-wise attention pattern, which remains causal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.