Promises, Outlooks and Challenges of Diffusion Language Modeling
Justin Deschenaux, Caglar Gulcehre

TL;DR
This paper evaluates the Score Entropy Discrete Diffusion (SEDD) approach as an alternative to autoregressive language models, highlighting its comparable performance, improved inference efficiency, and current limitations in conditional generation.
Contribution
It provides an empirical assessment of SEDD, demonstrating its potential advantages over autoregressive models and identifying areas for improvement.
Findings
SEDD matches autoregressive models in perplexity and benchmark tasks.
SEDD can be up to 4.5 times more efficient in inference than GPT-2.
SEDD is slightly weaker than GPT-2 in conditional generation with short prompts.
Abstract
The modern autoregressive Large Language Models (LLMs) have achieved outstanding performance on NLP benchmarks, and they are deployed in the real world. However, they still suffer from limitations of the autoregressive training paradigm. For example, autoregressive token generation is notably slow and can be prone to \textit{exposure bias}. The diffusion-based language models were proposed as an alternative to autoregressive generation to address some of these limitations. We evaluate the recently proposed Score Entropy Discrete Diffusion (SEDD) approach and show it is a promising alternative to autoregressive generation but it has some short-comings too. We empirically demonstrate the advantages and challenges of SEDD, and observe that SEDD generally matches autoregressive models in perplexity and on benchmarks such as HellaSwag, Arc or WinoGrande. Additionally, we show that in terms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Attention Dropout · Weight Decay · Dropout · Adam · Linear Warmup With Cosine Annealing
