PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
Shengyin Sun, Yiming Li, Renxi Liu, Xinqi Li, Hui-Ling Zhen, Weizhe Lin, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma

TL;DR
PSD is a training-free framework that enhances diffusion LLM inference efficiency by adaptively unmasking tokens and collapsing denoising steps, significantly reducing inference costs while maintaining quality.
Contribution
It introduces Parallel Speculative Decoding, a novel method combining spatial and temporal inference improvements without additional training.
Findings
Achieves up to 5.5x tokens per forward pass.
Maintains accuracy comparable to greedy decoding.
Effective across reasoning and code generation tasks.
Abstract
Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
