FOCUS: DLLMs Know How to Tame Their Compute Bound

Kaihua Liang; Xin Tan; An Zhong; Hong Xu; Marco Canini

arXiv:2601.23278·cs.LG·February 2, 2026

FOCUS: DLLMs Know How to Tame Their Compute Bound

Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini

PDF

Open Access

TL;DR

FOCUS is a system that improves the efficiency of diffusion large language models by dynamically focusing on decodable tokens, significantly increasing throughput while maintaining quality.

Contribution

We introduce FOCUS, a novel inference system that dynamically prioritizes decodable tokens in DLLMs, reducing compute waste and enabling scalable, high-throughput decoding.

Findings

01

Up to 3.52× throughput improvement over LMDeploy

02

Maintains or improves generation quality

03

Effectively scales DLLM decoding performance

Abstract

Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52 $\times$ throughput…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning