Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs
Endri Taka, Andre Roesti, Joseph Melber, Pranathi Vasireddy, Kristof Denolf, Diana Marculescu

TL;DR
This paper introduces a systematic methodology for optimizing GEMM workloads on AMD's Ryzen AI NPUs, achieving state-of-the-art throughput across two generations and various precisions, with detailed performance insights.
Contribution
It presents a unified optimization approach tailored to AMD's Ryzen AI NPUs, exploiting architectural features to enhance GEMM performance across multiple hardware generations.
Findings
Achieved up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for int8 precision.
Attained up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2) for bf16 precision.
Provided detailed performance analysis and insights for GEMM workloads on Ryzen AI NPUs.
Abstract
The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Numerical Methods and Algorithms · Ferroelectric and Negative Capacitance Devices
