Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Jiakun Fan; Yanglin Zhang; Xiangchen Li; Dimitrios S. Nikolopoulos

arXiv:2512.17077·cs.DC·January 16, 2026

Taming the Memory Footprint Crisis: System Design for Production Diffusion LLM Serving

Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos

PDF

Open Access

TL;DR

This paper introduces dLLM-Serve, a comprehensive system for efficient diffusion LLM serving that addresses memory and performance challenges, achieving significant throughput and latency improvements on diverse hardware.

Contribution

We present dLLM-Serve, a holistic serving framework with novel techniques to optimize memory, scheduling, and sparsity for diffusion LLMs in production environments.

Findings

01

Up to 1.81× throughput improvement on RTX 4090

02

Nearly 4× reduction in tail latency under contention

03

First scalable blueprint for diffusion LLM inference

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to Autoregressive Models (ARMs), utilizing parallel decoding to overcome sequential bottlenecks. However, existing research focuses primarily on kernel-level optimizations, lacking a holistic serving framework that addresses the unique memory dynamics of diffusion processes in production. We identify a critical "memory footprint crisis" specific to dLLMs, driven by monolithic logit tensors and the severe resource oscillation between compute-bound "Refresh" phases and bandwidth-bound "Reuse" phases. To bridge this gap, we present dLLM-Serve, an efficient dLLM serving system that co-optimizes memory footprint, computational scheduling, and generation quality. dLLM-Serve introduces Logit-Aware Activation Budgeting to decompose transient tensor peaks, a Phase-Multiplexed Scheduler to interleave heterogeneous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Ferroelectric and Negative Capacitance Devices