LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing

TL;DR
LLaDA2.0 introduces a scalable method to convert pre-trained auto-regressive language models into large discrete diffusion models up to 100 billion parameters, enhancing efficiency and performance for deployment.
Contribution
The paper presents a novel 3-phase block-level WSD training scheme for converting AR models into diffusion LLMs, enabling scalable, knowledge-preserving model expansion.
Findings
Achieved 100B parameter diffusion models with superior performance.
Demonstrated effective conversion from AR to diffusion models.
Open-sourced models for practical deployment.
Abstract
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗inclusionAI/LLaDA2.0-mini-previewmodel· 263 dl· ♡ 90263 dl♡ 90
- 🤗inclusionAI/LLaDA2.0-flash-previewmodel· 27 dl· ♡ 6827 dl♡ 68
- 🤗inclusionAI/LLaDA2.0-minimodel· 70k dl· ♡ 6170k dl♡ 61
- 🤗inclusionAI/LLaDA2.0-flashmodel· 1.1k dl· ♡ 681.1k dl♡ 68
- 🤗inclusionAI/LLaDA2.0-flash-CAPmodel· 7 dl· ♡ 107 dl♡ 10
- 🤗inclusionAI/LLaDA2.0-mini-CAPmodel· 2.6k dl· ♡ 102.6k dl♡ 10
- 🤗temsa/IrishCore-DiffMask-135M-v1-rc1model· 466 dl466 dl
- 🤗temsa/IrishCore-DiffMask-135M-v1-rc2model· 406 dl406 dl
- 🤗temsa/IrishCore-DiffMask-135M-v1-rc3model· 413 dl413 dl
- 🤗temsa/IrishCore-DiffMask-135M-v1-rc4model· 411 dl411 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
