TL;DR
TIDE is a novel, lossless inference system for diffusion LLMs that optimizes expert offload and scheduling, significantly improving throughput on resource-constrained devices without requiring model retraining.
Contribution
It introduces an I/O-aware expert refresh strategy and mathematical scheduling optimization for efficient, lossless diffusion LLM inference without model training.
Findings
Achieves up to 1.4× and 1.5× throughput improvements on LLaDA2.0 models.
Leverages temporal stability of expert activations for resource-efficient inference.
Provides a lossless, no-training-required acceleration method.
Abstract
Diffusion Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive (AR) models, offering better hardware utilization and bidirectional context through parallel block-level decoding. However, as dLLMs continue to scale up with mixture-of-experts (MoE) architectures, their deployment on resource-constrained devices remains an open challenge. Existing AR-based methods often incur either prohibitive I/O overhead or significant compute bottlenecks. In this work, we propose TIDE, a novel resource-efficient inference system that leverages the temporal stability of expert activations during the diffusion process within the block. Specifically, we leverage the temporal stability of expert activations during the diffusion process within the block and introduce an interval-based expert refresh strategy that updates the expert placement in an I/O-aware fashion. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
