Beyond Autoregression: Fast LLMs via Self-Distillation Through Time
Justin Deschenaux, Caglar Gulcehre

TL;DR
This paper introduces a novel diffusion-based language model that generates multiple tokens simultaneously, surpasses autoregressive models in quality, and is significantly faster during inference due to a new distillation technique reducing steps by 32-64 times.
Contribution
The paper presents a new discrete diffusion model for language generation that outperforms autoregressive models in speed and quality, with a novel distillation method for efficiency.
Findings
Diffusion models generate at least 32 tokens simultaneously.
Diffusion models outperform AR models in text quality and benchmark performance.
Inference speed is up to 8 times faster than KV-cached AR models at 1.3B parameters.
Abstract
Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advances have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the…
Peer Reviews
Decision·ICLR 2025 Poster
- Improved Generation Quality and Speed: The paper achieves remarkable improvements in both generation quality and efficiency, with fewer decoding steps required. This is a significant advancement for the field. - Clarity of Writing: The paper is well-written, with clear explanations of the methodologies and results, making it accessible to a wide audience.
- Insufficient Validation on Generation Speed: While generation speed is a key advantage highlighted in the paper, the validation seems inadequate in Section 4.4 and Figure 2b. The experimental settings lack clarity. For instance, it's not clear if the reported 8x speedup is based on 32 steps, 1.3B, and a batch size of 8. Additionally, the paper does not clarify whether the quality (generation perplexity) is comparable between 1.3B DLM and AR models. - Impact of Batch Size and Model Size: There
The method is simple and intuitive. The results for generative perplexity and number of function evaluations look promising. However, wall-clock time / latency is most important.
The primary weakness is presentation and writing. The abstract claims that the generates tokens 8x faster than an AR baseline with KV-caching. This must be presented in a figure as early as possible (see comment 10). See comments below for more suggestions.
* The SDTT methodology appears to significantly improve the decoding speed of discrete diffusion models, and the authors provide evidence of this over a large set of experiments. Compressing discrete diffusion sampling to this degree seems like an important contribution that will be used by future text diffusion works. * The paper conducts a thorough set of ablations justifying key design choices such as the particular divergence measure and the number of steps compressed per SDTT round. * The
* Performing multiple rounds of SDTT seems integral to the empirical success the paper has but the detail that SDTT is performed over multiple rounds seems to be introduced in the experiments section. It would help the clarity of the paper if this technique was introduced as a core part of the algorithm. * As mentioned by the authors, previous work has explored distilling multiple diffusion steps in the continuous diffusion setting ([1], [2]). While the technique in SDTT has slight differences
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Ferroelectric and Negative Capacitance Devices · Fuel Cells and Related Materials
MethodsDiffusion
