Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, Lingpeng Kong

TL;DR
This paper presents a method to adapt existing autoregressive language models into diffusion models, enabling efficient training and competitive performance on various language tasks.
Contribution
The authors introduce a simple continual pre-training approach to convert AR models into diffusion models, demonstrating competitive results with less training data.
Findings
Converted AR models outperform earlier diffusion models.
Diffusion models achieve performance comparable to AR models on benchmarks.
Models can generate fluent text and follow instructions effectively.
Abstract
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and…
Peer Reviews
Decision·ICLR 2025 Poster
The paper presents a simple and effective approach to adapting AR models to build DLMs, bridging the gap between the two modeling paradigms, while the proposed continual discrete diffusion pre-training approach for training diffusion models is practical and effective. The proposed models, DiffuGPT and DiffuLLaMA, show competitive performance with AR models and outperform earlier DLMs in generating fluent text, perform in-context learning, filling in the middle without prompt re-ordering, and fol
Overall, I appreciate the achievement of well-performing (discrete) diffusion LMs at scale, but I do have some concerns regarding novelty and evaluation. **About building DLMs by scaling and adapting from pre-trained large-scale LMs:** Building (discrete) diffusion LMs from pre-trained Masked LMs has been studied in [1], while scaling and instruct-tuning (discrete) diffusion by adapting large-scale pre-trained LMs (mainly Masked-LMs while they also attempted to use AR-LMs/Llama1 to test reason
* Adapting AR models to diffusion models is a problem that is of interest to the ICLR community. * The proposed method is intuitive and simple. * The paper is well written and flows really well. * The results demonstrate that the adaptation is successful to some extent: Adapting larger LLMs leads to better performance. * The DD model shows superior performance on some tasks that mostly require infilling.
1. In lines 364-368, it states that "by observing different scales of our adapted diffusion models, we can draw the conclusion that scaling diffusion language models results in improved performance". This statement suggests that if a DD model was trained from scratch, performance would scale with the size of the model. However, this isn't supported by the experiments, which merely shows that AR models of increasing size can be adapted to DD models in such a way, that larger adapted models perfor
1. The paper proposes to adapt AR pretraining parameters to Diffusion parameters, which solves a very important problem about improving the training efficiency of diffusion models. 2. The result seems solid, with a fair amount of improvement over previous diffusion models. But I still think there are some unresolved problems as shown in the weakness section. 3. The experiment seems pretty comprehensive, with a wide range of downstream task, generation quality, analysis, ablation, and inference
Since the paper claimed to tackle the parameter transferability problem, there remain some unresolved scientific questions: (1) suppose we have more compute, is this transfer still optimal? would this transfer converge to a worse local optimum than if we train from scratch for longer? (2) The paper included ablation results on GSM8K performance, but more fundamentally, what's the loss in terms of PPL for the (without shift) and (without annealing) baselines? I think these are more important an
Code & Models
- 🤗opendatalab/MinerU-Diffusion-V1-0320-2.5Bmodel· 889 dl· ♡ 19889 dl♡ 19
- 🤗diffusionfamily/diffullamamodel· 302 dl· ♡ 13302 dl♡ 13
- 🤗diffusionfamily/diffugpt-smodel· 27 dl· ♡ 427 dl♡ 4
- 🤗diffusionfamily/diffugpt-mmodel· 136 dl136 dl
- 🤗QuantFactory/diffullama-GGUFmodel· 186 dl· ♡ 3186 dl♡ 3
- 🤗RichardErkhov/diffusionfamily_-_diffullama-ggufmodel· 72 dl72 dl
- 🤗diffusionfamily/diffullama-gsmmodel· 6 dl· ♡ 36 dl♡ 3
- 🤗RichardErkhov/diffusionfamily_-_diffullama-4bitsmodel· 1 dl1 dl
- 🤗RichardErkhov/diffusionfamily_-_diffullama-8bitsmodel· 1 dl1 dl
- 🤗temsa/IrishCore-DiffMask-135M-v1-rc1model· 466 dl466 dl
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsDiffusion
