Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

Endri Taka; Andre Roesti; Joseph Melber; Pranathi Vasireddy; Kristof Denolf; Diana Marculescu

arXiv:2512.13282·cs.AR·December 16, 2025

Striking the Balance: GEMM Performance Optimization Across Generations of Ryzen AI NPUs

Endri Taka, Andre Roesti, Joseph Melber, Pranathi Vasireddy, Kristof Denolf, Diana Marculescu

PDF

Open Access

TL;DR

This paper introduces a systematic methodology for optimizing GEMM workloads on AMD's Ryzen AI NPUs, achieving state-of-the-art throughput across two generations and various precisions, with detailed performance insights.

Contribution

It presents a unified optimization approach tailored to AMD's Ryzen AI NPUs, exploiting architectural features to enhance GEMM performance across multiple hardware generations.

Findings

01

Achieved up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for int8 precision.

02

Attained up to 3.14 TOPS (XDNA) and 14.71 TOPS (XDNA2) for bf16 precision.

03

Provided detailed performance analysis and insights for GEMM workloads on Ryzen AI NPUs.

Abstract

The high computational and memory demands of modern deep learning (DL) workloads have led to the development of specialized hardware devices from cloud to edge, such as AMD's Ryzen AI XDNA NPUs. Optimizing general matrix multiplication (GEMM) algorithms for these architectures is critical for improving DL workload performance. To this end, this paper presents a common systematic methodology to optimize GEMM workloads across the two current NPU generations, namely XDNA and XDNA2. Our implementations exploit the unique architectural features of AMD's NPUs and address key performance bottlenecks at the system level. End-to-end performance evaluation across various GEMM sizes demonstrates state-of-the-art throughput of up to 6.76 TOPS (XDNA) and 38.05 TOPS (XDNA2) for 8-bit integer (int8) precision. Similarly, for brain floating-point (bf16) precision, our GEMM implementations attain up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Numerical Methods and Algorithms · Ferroelectric and Negative Capacitance Devices