Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Justin Deschenaux; Caglar Gulcehre

arXiv:2410.21035·cs.LG·February 10, 2025

Beyond Autoregression: Fast LLMs via Self-Distillation Through Time

Justin Deschenaux, Caglar Gulcehre

PDF

Open Access 1 Repo 1 Models 3 Reviews

TL;DR

This paper introduces a novel diffusion-based language model that generates multiple tokens simultaneously, surpasses autoregressive models in quality, and is significantly faster during inference due to a new distillation technique reducing steps by 32-64 times.

Contribution

The paper presents a new discrete diffusion model for language generation that outperforms autoregressive models in speed and quality, with a novel distillation method for efficiency.

Findings

01

Diffusion models generate at least 32 tokens simultaneously.

02

Diffusion models outperform AR models in text quality and benchmark performance.

03

Inference speed is up to 8 times faster than KV-cached AR models at 1.3B parameters.

Abstract

Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advances have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- Improved Generation Quality and Speed: The paper achieves remarkable improvements in both generation quality and efficiency, with fewer decoding steps required. This is a significant advancement for the field. - Clarity of Writing: The paper is well-written, with clear explanations of the methodologies and results, making it accessible to a wide audience.

Weaknesses

- Insufficient Validation on Generation Speed: While generation speed is a key advantage highlighted in the paper, the validation seems inadequate in Section 4.4 and Figure 2b. The experimental settings lack clarity. For instance, it's not clear if the reported 8x speedup is based on 32 steps, 1.3B, and a batch size of 8. Additionally, the paper does not clarify whether the quality (generation perplexity) is comparable between 1.3B DLM and AR models. - Impact of Batch Size and Model Size: There

Reviewer 02Rating 8Confidence 3

Strengths

The method is simple and intuitive. The results for generative perplexity and number of function evaluations look promising. However, wall-clock time / latency is most important.

Weaknesses

The primary weakness is presentation and writing. The abstract claims that the generates tokens 8x faster than an AR baseline with KV-caching. This must be presented in a figure as early as possible (see comment 10). See comments below for more suggestions.

Reviewer 03Rating 8Confidence 3

Strengths

* The SDTT methodology appears to significantly improve the decoding speed of discrete diffusion models, and the authors provide evidence of this over a large set of experiments. Compressing discrete diffusion sampling to this degree seems like an important contribution that will be used by future text diffusion works. * The paper conducts a thorough set of ablations justifying key design choices such as the particular divergence measure and the number of steps compressed per SDTT round. * The

Weaknesses

* Performing multiple rounds of SDTT seems integral to the empirical success the paper has but the detail that SDTT is performed over multiple rounds seems to be introduced in the experiments section. It would help the clarity of the paper if this technique was introduced as a core part of the algorithm. * As mentioned by the authors, previous work has explored distilling multiple diffusion steps in the continuous diffusion setting ([1], [2]). While the technique in SDTT has slight differences

Code & Models

Repositories

jdeschena/sdtt
pytorchOfficial

Models

🤗
jdeschena/sdtt
model· 87 dl· ♡ 3
87 dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Ferroelectric and Negative Capacitance Devices · Fuel Cells and Related Materials

MethodsDiffusion