Gumbel Distillation for Parallel Text Generation

Chi Zhang; Xixi Hu; Bo Liu; Qiang Liu

arXiv:2603.22216·cs.CL·March 24, 2026

Gumbel Distillation for Parallel Text Generation

Chi Zhang, Xixi Hu, Bo Liu, Qiang Liu

PDF

Open Access 3 Reviews

TL;DR

Gumbel Distillation is a new technique that improves parallel language models by enabling them to better learn complex token distributions, significantly enhancing their generation quality without sacrificing decoding speed.

Contribution

We introduce Gumbel Distillation, a model-agnostic method that leverages the Gumbel-Max trick to improve parallel decoders' ability to model joint token distributions.

Findings

01

30% improvement in MAUVE score on OpenWebText

02

10.5% reduction in generative perplexity

03

Effective across diverse parallel decoding architectures

Abstract

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0%…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper directly addresses the long-standing trade-off in language model decoding: parallel decoders sacrifice generation quality (due to poor joint token distribution modeling) for speed, while autoregressive (AR) models excel at quality but are slow. 2. The introduced knowledge distillation mechanism to transfer the AR teacher’s sequential dependency knowledge to parallel students. This focus on "fixing the joint distribution defect" aligns with the most critical unmet need in parallel d

Weaknesses

1. A notable omission is the lack of targeted discussion on the classical "AR-supervised NAR distillation" baseline, a well-established method in non-autoregressive decoding where NAR models are trained to mimic the sampled outputs of AR teachers via cross-entropy (CE) loss or sequence-level losses. This gap may lead readers to question whether the authors have overlooked a foundational approach in the field. 2. While the Gumbel-Max trick and knowledge distillation are individually well-establis

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper correctly identifies that the main challenge of parallel decoding lies in learning the joint token dependencies. By introducing Gumbel noise as an explicit conditioning variable, the proposed approach offers a conceptually clean way to transfer dependency structure from AR to parallel models. 2. The idea of externalizing stochasticity to simplify the learning problem is an important insight likely to inspire future work in distillation and non-AR training. 3. The method is designed

Weaknesses

1. Limited and marginal empirical gains. In particular, Figure 3 shows that the NFE–quality trade-off is not clearly improved—on one dataset the curve even slightly underperforms the baseline, and on the others the advantages are only marginal. The improvements on reasoning and QA benchmarks (e.g., BoolQ, ARC) are also small. 2. The paper does not compare with few-step DLM acceleration baselines such as APD (Adaptive Parallel Decoding, arXiv:2506.00413). 3. Although the approach simplifies distr

Reviewer 03Rating 8Confidence 3

Strengths

The paper is very well written, and it is clear that the authors have put a lot of time and care into the presentation. This extends to figures, tables and the appendix as well. The method is (to at least my understanding) sound, and the explanation is pedagogical with consistent notation. The empirical results are persuasive in that they are both consistent and strong. The tasks selected for evaluation seem relevant, and give a fairly comprehensive overview of the expected gains from the gumb

Weaknesses

The authors currently don’t demonstrate any results or arguments for the scalability of this approach. Considering that a major component of how well LLM works is their ability to scale, such an addition would further strengthen this paper. The only part relevant to this seems to be the brief discussion regarding the scalability of vocabulary size. Personally, I’d prefer to see an experiment where the number of model parameters are scaled in the main paper, rather than the existing ablation stud

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques