Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving
Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

TL;DR
This paper introduces dLLM-Serve, a comprehensive system for efficient diffusion LLM serving that addresses memory and performance challenges, achieving significant throughput and latency improvements on diverse hardware.
Contribution
We present dLLM-Serve, a holistic serving framework with novel techniques to optimize memory, scheduling, and sparsity for diffusion LLMs in production environments.
Findings
Up to 1.81× throughput improvement on RTX 4090
Nearly 4× reduction in tail latency under contention
First scalable blueprint for diffusion LLM inference
Abstract
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Ferroelectric and Negative Capacitance Devices
