d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation
Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, Hao Zhang

TL;DR
d3LLM introduces a novel training and inference approach for diffusion-based large language models, balancing accuracy and parallelism to enable faster decoding without significant performance loss.
Contribution
The paper proposes pseudo-trajectory distillation and entropy-based multi-block decoding for diffusion LLMs, achieving high parallelism and accuracy balance.
Findings
Up to 10x speedup over vanilla LLaDA/Dream.
5x faster than autoregressive models with minimal accuracy loss.
Introduces AUP metric for joint accuracy and parallelism evaluation.
Abstract
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
