High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines
Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura

TL;DR
This paper introduces novel emulation methods for matrix multiplication on INT8 matrix engines, achieving significant speedups and power efficiency improvements over conventional approaches on a high-performance superchip.
Contribution
The study presents new emulation techniques that outperform existing methods in speed and power efficiency for low-precision matrix multiplication on modern hardware.
Findings
DGEMM emulation: 1.4x speedup, 43% power efficiency gain
SGEMM emulation: 3.0x speedup, 154% power efficiency gain
Over 2x performance improvement over conventional emulation methods
Abstract
Recent architectures integrate high-performance and power-efficient matrix engines. These engines demonstrate remarkable performance in low-precision matrix multiplication, which is crucial in deep learning. Several techniques have been proposed to emulate single- and double-precision general matrix-matrix multiplication (SGEMM and DGEMM, respectively) by leveraging such low-precision matrix engines. In this study, we present emulation methods that significantly outperforms conventional approaches. On a GH200 Grace Hopper Superchip, the proposed DGEMM emulation achieves a 1.4x speedup and a 43% improvement in power efficiency compared to native DGEMM for sufficiently large problems. The proposed SGEMM emulation achieves a 3.0x speedup and a 154% improvement in power efficiency compared to native SGEMM for sufficiently large problems. Furthermore, compared to conventional emulation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
