DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Fuliang Liu; Xue Li; Ketai Zhao; Yinxi Gao; Ziyan Zhou; Zhonghui Zhang; Zhibin Wang; Wanchun Dou; Sheng Zhong; Chen Tian

arXiv:2601.19278·cs.CL·January 28, 2026

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Fuliang Liu, Xue Li, Ketai Zhao, Yinxi Gao, Ziyan Zhou, Zhonghui Zhang, Zhibin Wang, Wanchun Dou, Sheng Zhong, Chen Tian

PDF

Open Access

TL;DR

DART introduces a diffusion-inspired speculative decoding method that predicts multiple future tokens in parallel, significantly accelerating large language model inference while maintaining high accuracy.

Contribution

It proposes a novel parallel logit prediction approach and an efficient tree pruning algorithm, reducing decoding latency and outperforming existing methods like EAGLE3.

Findings

01

Achieves 2.03x to 3.44x speedup in decoding time.

02

Surpasses EAGLE3 by 30% on average in speed.

03

Maintains high draft accuracy with reduced overhead.

Abstract

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference, resulting in high drafting latency and ultimately rendering the drafting stage itself a performance bottleneck. Inspired by diffusion-based large language models (dLLMs), we propose DART, which leverages parallel generation to reduce drafting latency. DART predicts logits for multiple future masked positions in parallel within a single forward pass based on hidden states of the target model, thereby eliminating autoregressive rollouts in the draft model while preserving a lightweight design. Based on these parallel logit predictions, we further introduce an efficient tree pruning algorithm that constructs high-quality draft token trees with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods