Gumbel Distillation for Parallel Text Generation
Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu

TL;DR
Gumbel Distillation is a new technique that improves parallel language models by enabling them to better learn complex token distributions, significantly enhancing their generation quality without sacrificing decoding speed.
Contribution
We introduce Gumbel Distillation, a model-agnostic method that leverages the Gumbel-Max trick to improve parallel decoders' ability to model joint token distributions.
Findings
30% improvement in MAUVE score on OpenWebText
10.5% reduction in generative perplexity
Effective across diverse parallel decoding architectures
Abstract
The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0%…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper directly addresses the long-standing trade-off in language model decoding: parallel decoders sacrifice generation quality (due to poor joint token distribution modeling) for speed, while autoregressive (AR) models excel at quality but are slow. 2. The introduced knowledge distillation mechanism to transfer the AR teacher’s sequential dependency knowledge to parallel students. This focus on "fixing the joint distribution defect" aligns with the most critical unmet need in parallel d
1. A notable omission is the lack of targeted discussion on the classical "AR-supervised NAR distillation" baseline, a well-established method in non-autoregressive decoding where NAR models are trained to mimic the sampled outputs of AR teachers via cross-entropy (CE) loss or sequence-level losses. This gap may lead readers to question whether the authors have overlooked a foundational approach in the field. 2. While the Gumbel-Max trick and knowledge distillation are individually well-establis
1. The paper correctly identifies that the main challenge of parallel decoding lies in learning the joint token dependencies. By introducing Gumbel noise as an explicit conditioning variable, the proposed approach offers a conceptually clean way to transfer dependency structure from AR to parallel models. 2. The idea of externalizing stochasticity to simplify the learning problem is an important insight likely to inspire future work in distillation and non-AR training. 3. The method is designed
1. Limited and marginal empirical gains. In particular, Figure 3 shows that the NFE–quality trade-off is not clearly improved—on one dataset the curve even slightly underperforms the baseline, and on the others the advantages are only marginal. The improvements on reasoning and QA benchmarks (e.g., BoolQ, ARC) are also small. 2. The paper does not compare with few-step DLM acceleration baselines such as APD (Adaptive Parallel Decoding, arXiv:2506.00413). 3. Although the approach simplifies distr
The paper is very well written, and it is clear that the authors have put a lot of time and care into the presentation. This extends to figures, tables and the appendix as well. The method is (to at least my understanding) sound, and the explanation is pedagogical with consistent notation. The empirical results are persuasive in that they are both consistent and strong. The tasks selected for evaluation seem relevant, and give a fairly comprehensive overview of the expected gains from the gumb
The authors currently don’t demonstrate any results or arguments for the scalability of this approach. Considering that a major component of how well LLM works is their ability to scale, such an addition would further strengthen this paper. The only part relevant to this seems to be the brief discussion regarding the scalability of vocabulary size. Personally, I’d prefer to see an experiment where the number of model parameters are scaled in the main paper, rather than the existing ablation stud
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
