Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Qiao Zhang; Rabab Alomairy; Dali Wang; Zhuowei Gu; Qinglei Cao

arXiv:2508.14848·cs.DC·August 21, 2025

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach

Qiao Zhang, Rabab Alomairy, Dali Wang, Zhuowei Gu, Qinglei Cao

PDF

Open Access

TL;DR

This paper presents a hardware-aware, tile-centric mixed-precision GEMM framework that adapts precision at fine granularity, significantly improving performance and energy efficiency across diverse high-performance computing architectures.

Contribution

It introduces an adaptive mixed-precision GEMM approach supported by the PaRSEC runtime, enabling efficient workload balancing on multiple architectures.

Findings

01

Scales well on ARM, Nvidia, and AMD architectures.

02

Improves performance and energy efficiency.

03

Supports fine-grained mixed-precision computation.

Abstract

General Matrix Multiplication (GEMM) is a critical operation underpinning a wide range of applications in high-performance computing (HPC) and artificial intelligence (AI). The emergence of hardware optimized for low-precision arithmetic necessitates a reevaluation of numerical algorithms to leverage mixed-precision computations, achieving improved performance and energy efficiency. This research introduces an adaptive mixed-precision GEMM framework that supports different precision formats at fine-grained tile/block levels. We utilize the PaRSEC runtime system to balance workloads across various architectures. The performance scales well on ARM CPU-based Fugaku supercomputer, Nvidia GPU-based A100 DGX, and AMD GPU-based Frontier supercomputer. This research aims to enhance computational efficiency and accuracy by bridging algorithmic advancements and hardware innovations, driving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Cellular Automata and Applications