High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Yuki Uchino; Katsuhisa Ozaki; Toshiyuki Imamura

arXiv:2508.03984·cs.DC·November 13, 2025

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines

Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

PDF

TL;DR

This paper introduces novel emulation methods for matrix multiplication on INT8 matrix engines, achieving significant speedups and power efficiency improvements over conventional approaches on a high-performance superchip.

Contribution

The study presents new emulation techniques that outperform existing methods in speed and power efficiency for low-precision matrix multiplication on modern hardware.

Findings

01

DGEMM emulation: 1.4x speedup, 43% power efficiency gain

02

SGEMM emulation: 3.0x speedup, 154% power efficiency gain

03

Over 2x performance improvement over conventional emulation methods

Abstract

Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.