DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Younjoo Lee; Junghoo Lee; Seungkyun Dan; Jaiyoung Park; Jung Ho Ahn

arXiv:2603.08026·cs.CL·March 10, 2026

DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho Ahn

PDF

Open Access

TL;DR

DyLLM introduces a saliency-based token selection method that significantly accelerates diffusion language model inference by focusing computation on salient tokens, maintaining high accuracy while improving throughput.

Contribution

DyLLM is a training-free inference framework that leverages temporal sparsity to selectively compute only salient tokens, reducing computational cost in diffusion LLMs.

Findings

01

Achieves up to 9.6x higher throughput on various benchmarks.

02

Largely preserves baseline accuracy of state-of-the-art models.

03

Effective across reasoning and code-generation tasks.

Abstract

Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Generative Adversarial Networks and Image Synthesis