Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong; Shivam Agarwal; Yizhe Zhang; Jiacheng Ye; Lin Zheng; Mukai Li; Chenxin An; Peilin Zhao; Wei Bi; Jiawei Han; Hao Peng; Lingpeng Kong

arXiv:2410.17891·cs.CL·June 3, 2025

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, Hao Peng, Lingpeng Kong

PDF

Open Access 1 Repo 10 Models 1 Video 3 Reviews

TL;DR

This paper presents a method to adapt existing autoregressive language models into diffusion models, enabling efficient training and competitive performance on various language tasks.

Contribution

The authors introduce a simple continual pre-training approach to convert AR models into diffusion models, demonstrating competitive results with less training data.

Findings

01

Converted AR models outperform earlier diffusion models.

02

Diffusion models achieve performance comparable to AR models on benchmarks.

03

Models can generate fluent text and follow instructions effectively.

Abstract

Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper presents a simple and effective approach to adapting AR models to build DLMs, bridging the gap between the two modeling paradigms, while the proposed continual discrete diffusion pre-training approach for training diffusion models is practical and effective. The proposed models, DiffuGPT and DiffuLLaMA, show competitive performance with AR models and outperform earlier DLMs in generating fluent text, perform in-context learning, filling in the middle without prompt re-ordering, and fol

Weaknesses

Overall, I appreciate the achievement of well-performing (discrete) diffusion LMs at scale, but I do have some concerns regarding novelty and evaluation. **About building DLMs by scaling and adapting from pre-trained large-scale LMs:** Building (discrete) diffusion LMs from pre-trained Masked LMs has been studied in [1], while scaling and instruct-tuning (discrete) diffusion by adapting large-scale pre-trained LMs (mainly Masked-LMs while they also attempted to use AR-LMs/Llama1 to test reason

Reviewer 02Rating 5Confidence 3

Strengths

* Adapting AR models to diffusion models is a problem that is of interest to the ICLR community. * The proposed method is intuitive and simple. * The paper is well written and flows really well. * The results demonstrate that the adaptation is successful to some extent: Adapting larger LLMs leads to better performance. * The DD model shows superior performance on some tasks that mostly require infilling.

Weaknesses

1. In lines 364-368, it states that "by observing different scales of our adapted diffusion models, we can draw the conclusion that scaling diffusion language models results in improved performance". This statement suggests that if a DD model was trained from scratch, performance would scale with the size of the model. However, this isn't supported by the experiments, which merely shows that AR models of increasing size can be adapted to DD models in such a way, that larger adapted models perfor

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper proposes to adapt AR pretraining parameters to Diffusion parameters, which solves a very important problem about improving the training efficiency of diffusion models. 2. The result seems solid, with a fair amount of improvement over previous diffusion models. But I still think there are some unresolved problems as shown in the weakness section. 3. The experiment seems pretty comprehensive, with a wide range of downstream task, generation quality, analysis, ablation, and inference

Weaknesses

Since the paper claimed to tackle the parameter transferability problem, there remain some unresolved scientific questions: (1) suppose we have more compute, is this transfer still optimal? would this transfer converge to a worse local optimum than if we train from scratch for longer? (2) The paper included ablation results on GSM8K performance, but more fundamentally, what's the loss in terms of PPL for the (without shift) and (without annealing) baselines? I think these are more important an

Code & Models

Repositories

hkunlp/diffullama
pytorchOfficial

Models

Videos

Scaling Diffusion Language Models via Adaptation from Autoregressive Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsDiffusion