FOCUS: DLLMs Know How to Tame Their Compute Bound
Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini

TL;DR
FOCUS is a system that improves the efficiency of diffusion large language models by dynamically focusing on decodable tokens, significantly increasing throughput while maintaining quality.
Contribution
We introduce FOCUS, a novel inference system that dynamically prioritizes decodable tokens in DLLMs, reducing compute waste and enabling scalable, high-throughput decoding.
Findings
Up to 3.52× throughput improvement over LMDeploy
Maintains or improves generation quality
Effectively scales DLLM decoding performance
Abstract
Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52 throughput…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
