Large Language Diffusion Models

Shen Nie; Fengqi Zhu; Zebin You; Xiaolu Zhang; Jingyang Ou; Jun Hu; Jun Zhou; Yankai Lin; Ji-Rong Wen; Chongxuan Li

arXiv:2502.09992·cs.CL·October 21, 2025·5 cites

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li

PDF

Open Access 2 Repos 4 Models 1 Video

TL;DR

This paper introduces LLaDA, a diffusion-based language model that challenges the dominance of autoregressive models, demonstrating competitive performance and strong instruction-following abilities at scale.

Contribution

LLaDA is a novel diffusion model for language tasks, trained from scratch, showing competitive results and capabilities comparable to traditional autoregressive large language models.

Findings

01

LLaDA performs comparably to ARM baselines on various benchmarks.

02

LLaDA 8B matches LLaMA3 8B in in-context learning.

03

LLaDA surpasses GPT-4o in a reversal poem completion task.

Abstract

The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Large Language Diffusion Models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAbsolute Position Encodings · Dense Connections · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Attention Is All You Need · Multi-Head Attention · Diffusion