NPU Design for Diffusion Language Model Inference
Binglei Lou, Haoran Wu, Kevin Lau, Gregor MacDonald, Jiayi Nie, Yao Lai, Can Xiao, Xuan Guo, Jianyi Cheng, Rika Antonova, Robert Mullins, Aaron Zhao

TL;DR
This paper presents the first NPU specifically designed for diffusion-based language models, introducing new ISA, hardware, and quantization techniques tailored to their unique inference patterns.
Contribution
It develops a dedicated NPU architecture with specialized ISA, execution model, and quantization scheme for diffusion LLMs, validated through comprehensive simulation and RTL implementation.
Findings
Achieves hardware support for bidirectional attention and block-wise KV cache in dLLMs.
Introduces Block-Adaptive Online Smoothing (BAOS) for effective KV cache quantization.
Provides a complete RTL implementation and simulation framework for the proposed NPU.
Abstract
Diffusion-based LLMs (dLLMs) fundamentally depart from traditional autoregressive (AR) LLM inference: they leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase. These characteristics make current dLLMs incompatible with most existing NPUs, as their inference patterns, in particular the reduction-heavy, top--driven sampling stage, demand new ISA and memory hierarchy support beyond that of AR accelerators. In addition, the blocked diffusion KV cache breaks from the append-only paradigm assumed by AR NPUs, and conventional AR-derived KV quantization schemes were designed for static activation distributions and do not account for the step-wise distribution shifts introduced by iterative block-wise refinement in dLLMs. In this paper, we introduce the first NPU accelerator specifically designed for dLLMs. It delivers: a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
