W4A16 Mixed-Precision Matrix Multiplication on Decoupled Architecture: Kernel Design and Memory Bottleneck Analysis for Ascend NPUs

Yuanhong He; Peiyu Niu; Jun Chen; Chenchen Zhang; Chao Yang

arXiv:2601.16536·cs.DC·March 4, 2026

W4A16 Mixed-Precision Matrix Multiplication on Decoupled Architecture: Kernel Design and Memory Bottleneck Analysis for Ascend NPUs

Yuanhong He, Peiyu Niu, Jun Chen, Chenchen Zhang, Chao Yang

PDF

Open Access

TL;DR

This paper introduces a practical W4A16 matrix multiplication kernel for Huawei's Ascend 910 NPU, optimizing mixed-precision LLM deployment by addressing architecture-specific challenges and memory bottlenecks.

Contribution

It presents the first tailored W4A16 kernel for Ascend 910, leveraging vector and cube cores, and analyzes performance bottlenecks for efficient quantized LLM deployment.

Findings

01

Achieves 1.01x to 1.74x speedup over data-parallel methods.

02

Memory transfer, not dequantization, is the main bottleneck.

03

Maximum speedup of 1.48x over native FP16 in PyTorch.

Abstract

As Large Language Models (LLMs) scale, weight-only quantization (W4A16: 4-bit weights, 16-bit activations) becomes critical for reducing memory footprint with minimal accuracy loss. However, its efficient deployment on Huawei's Ascend 910 Neural Processing Unit (NPU) is challenging due to limited native mixed-precision support and the accelerator's decoupled compute architecture. To enable quantization on such architecture, we present the first practical W4A16 matrix multiplication kernel tailored for the Ascend 910 NPU. Our design leverages vector cores for on-the-fly INT4-to-FP16 dequantization, cube cores for high-throughput GEMM, and Split-K parallelization to mitigate memory latency. Performance evaluations across diverse matrix shapes and batch sizes show our method outperforms data-parallel approaches when K >> N, a typical scenario in LLM decoding. Specially, our method can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Advanced Neural Network Applications