Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models
Etrit Haxholli, Yeti Z. Gurbuz, Ogul Can, Eli Waxman

TL;DR
This paper improves discrete diffusion language models by introducing new theorems for KL divergence, deriving a better perplexity bound, and demonstrating that ratio-matching with denoising cross-entropy outperforms previous methods in perplexity and training speed.
Contribution
The paper presents new theoretical bounds, an improved perplexity measure, and a novel CTMC transition-rate matrix for more efficient and accurate discrete diffusion language modeling.
Findings
Ratio-matching with denoising cross-entropy reduces perplexity by up to 10%.
Proposed CTMC transition-rate matrix enables faster training and better prediction refinement.
Derived analytic expression for matrix exponential facilitates efficient computation.
Abstract
While continuous diffusion models excel in modeling continuous distributions, their application to categorical data has been less effective. Recent work has shown that ratio-matching through score-entropy within a continuous-time discrete Markov chain (CTMC) framework serves as a competitive alternative to autoregressive models in language modeling. To enhance this framework, we first introduce three new theorems concerning the KL divergence between the data and learned distribution. Our results serve as the discrete counterpart to those established for continuous diffusion models and allow us to derive an improved upper bound of the perplexity. Second, we empirically show that ratio-matching performed by minimizing the denoising cross-entropy between the clean and corrupted data enables models to outperform those utilizing score-entropy with up to 10% lower…
Peer Reviews
Decision·ICLR 2025 Poster
The paper is well-written and clear. Developing efficient perplexity bound and the training algorithm for discrete diffusion language models is an important yet challenging problem. The particular forms of the bounds, the roulette diffusion matrix, the denoising cross entropy loss and are interesting and new. Both theoretical development and empirical evaluations are conducted.
From Table 2, overall GPT-2 performs best over SEDDs, CEDD*, and CEDDT. The practical usefulness of the new bound and training algorithm is not completely clear.
The novelty of the contents of this paper consists of several theoretical findings and undoubtfully is the main strength of this paper. These findings include a novel type of transition matrices for discrete diffusion models along with all the facts necessary to use it in practice. The feasibility of using this type of forward diffusions was shown through experiments. Second, a score reparameterization was used to derive a CEDD model competitive with a common SEDD model. Lastly, the novel bound
1. I am not sure about the correctness of perplexity evaluation in Section 4.1. To evaluate likelihood (which is equivalent to perplexity evaluation) with an imperfect model may lead to unreliable results as you mention in line 362. I would suggest to use some better and more modern language models instead of GPT-2 and report some aggregated score. 2. CEDD* differs from CEDD by some hand-crafted time-dependent weights (line 324) and the results across different transition matrices types differ a
The paper is well written and introduces, to the best of my knowledge, a novel set of theorems.
* My primary concern is that the authors appear to utilize the official implementation of SEDD, which has been demonstrated to have a flaw in the gumbel max trick used for sampling. In [1], the authors show that when using sampling in low precision for high dimensional distribution it leads to an effect similar to sampling with temperature. This is highly related here as the exponential in the proposed method may even anneal the temperature more. * Prior Work: The authors only compare their resu
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
