Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

Stef Cuyckens; Xiaoling Yi; Robin Geens; Joren Dumoulin; Martin Wiesner; Chao Fang; Marian Verhelst

arXiv:2511.06313·cs.AR·March 13, 2026

Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

Stef Cuyckens, Xiaoling Yi, Robin Geens, Joren Dumoulin, Martin Wiesner, Chao Fang, Marian Verhelst

PDF

TL;DR

This paper introduces a hybrid precision-scalable reduction tree for MX MACs, enabling efficient mixed-precision accumulation in NPUs, and demonstrates significant energy efficiency improvements in an integrated system.

Contribution

It proposes a novel hybrid reduction tree for MX MACs and integrates it into a state-of-the-art NPU platform, enhancing energy efficiency and throughput for next-generation neural processing.

Findings

01

Achieves up to 4065 GOPS/W energy efficiency.

02

Supports multiple MX precisions with high throughput.

03

Demonstrates improved system-level performance over state-of-the-art.

Abstract

Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8x8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.