NPU Design for Diffusion Language Model Inference

Binglei Lou; Haoran Wu; Kevin Lau; Gregor MacDonald; Jiayi Nie; Yao Lai; Can Xiao; Xuan Guo; Jianyi Cheng; Rika Antonova; Robert Mullins; Aaron Zhao

arXiv:2601.20706·cs.AR·April 24, 2026

NPU Design for Diffusion Language Model Inference

Binglei Lou, Haoran Wu, Kevin Lau, Gregor MacDonald, Jiayi Nie, Yao Lai, Can Xiao, Xuan Guo, Jianyi Cheng, Rika Antonova, Robert Mullins, Aaron Zhao

PDF

TL;DR

This paper presents the first NPU specifically designed for diffusion-based language models, introducing new ISA, hardware, and quantization techniques tailored to their unique inference patterns.

Contribution

It develops a dedicated NPU architecture with specialized ISA, execution model, and quantization scheme for diffusion LLMs, validated through comprehensive simulation and RTL implementation.

Findings

01

Achieves hardware support for bidirectional attention and block-wise KV cache in dLLMs.

02

Introduces Block-Adaptive Online Smoothing (BAOS) for effective KV cache quantization.

03

Provides a complete RTL implementation and simulation framework for the proposed NPU.

Abstract

Diffusion-based LLMs (dLLMs) fundamentally depart from traditional autoregressive (AR) LLM inference: they leverage bidirectional attention, block-wise KV cache refreshing, cross-step reuse, and a non-GEMM-centric sampling phase. These characteristics make current dLLMs incompatible with most existing NPUs, as their inference patterns, in particular the reduction-heavy, top- $k$ -driven sampling stage, demand new ISA and memory hierarchy support beyond that of AR accelerators. In addition, the blocked diffusion KV cache breaks from the append-only paradigm assumed by AR NPUs, and conventional AR-derived KV quantization schemes were designed for static activation distributions and do not account for the step-wise distribution shifts introduced by iterative block-wise refinement in dLLMs. In this paper, we introduce the first NPU accelerator specifically designed for dLLMs. It delivers: a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.