Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning
Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, Quanquan Gu

TL;DR
This paper shows that scaling diffusion language models through data, size, and instruction finetuning enables them to perform a wide range of language tasks effectively, rivaling autoregressive models.
Contribution
It introduces a method to adapt pretrained masked language models into diffusion models and demonstrates their scalability and versatility in general language tasks.
Findings
Scaling improves downstream task performance
Instruction finetuning enables zero-shot and few-shot learning
Diffusion models show promising reasoning abilities
Abstract
The recent surge of generative AI has been fueled by the generative power of diffusion probabilistic models and the scalable capabilities of large language models. Despite their potential, it remains elusive whether diffusion language models can solve general language tasks comparable to their autoregressive counterparts. This paper demonstrates that scaling diffusion models w.r.t. data, sizes, and tasks can effectively make them strong language learners. We build competent diffusion language models at scale by first acquiring knowledge from massive data via masked language modeling pretraining thanks to their intrinsic connections. We then reprogram pretrained masked language models into diffusion language models via diffusive adaptation, wherein task-specific finetuning and instruction finetuning are explored to unlock their versatility in solving general language tasks. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
MethodsDiffusion
