Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

Liang Zheng; Bowen Shi; Yitao Hu; Jiawei Zhang; Ruofan Li; Sheng Chen; Wenxin Li; Keqiu Li

arXiv:2601.06562·cs.LG·January 13, 2026

Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

Liang Zheng, Bowen Shi, Yitao Hu, Jiawei Zhang, Ruofan Li, Sheng Chen, Wenxin Li, Keqiu Li

PDF

Open Access

TL;DR

Mosaic is a novel inference system that significantly reduces memory usage and extends the sequence length capacity of diffusion-based large language models by employing global memory planning and dynamic peak management.

Contribution

It introduces a global, dynamic memory management approach with a mask-only logits kernel and lazy chunking optimizer, enabling long-context inference for diffusion LLMs without sacrificing speed or accuracy.

Findings

01

Achieves 2.71× reduction in memory peak-to-average ratio.

02

Supports 15.89-32.98× longer sequences on the same hardware.

03

Reduces inference latency by 4.12%-23.26%.

Abstract

Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs' dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Topic Modeling