Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing

Lingkun Long; Yushi Huang; Shihao Bai; Ruihao Gong; Jun Zhang; Ao Zhou; Jianlei Yang

arXiv:2602.02159·cs.CL·February 3, 2026

Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing

Lingkun Long, Yushi Huang, Shihao Bai, Ruihao Gong, Jun Zhang, Ao Zhou, Jianlei Yang

PDF

Open Access

TL;DR

Focus-dLLM introduces a confidence-guided, training-free attention sparsification method that significantly accelerates long-context diffusion LLM inference without loss of accuracy, achieving over 29x speedup at 32K context length.

Contribution

It presents a novel, training-free framework for attention sparsification in diffusion LLMs, leveraging confidence prediction and sink-aware pruning for efficient inference.

Findings

01

Achieves over 29x speedup at 32K context length.

02

Maintains lossless inference accuracy.

03

Reuses sink locations across layers for efficiency.

Abstract

Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis