DGEMM on Integer Matrix Multiplication Unit

Hiroyuki Ootomo; Katsuhisa Ozaki; Rio Yokota

arXiv:2306.11975·cs.DC·April 2, 2024·Int. J. High Perform. Comput. Appl.·1 cites

DGEMM on Integer Matrix Multiplication Unit

Hiroyuki Ootomo, Katsuhisa Ozaki, Rio Yokota

PDF

Open Access 1 Repo

TL;DR

This paper explores leveraging integer matrix multiplication units (IMMUs) in deep learning hardware to accelerate high-precision matrix computations, demonstrating significant speedups in HPC and quantum circuit simulations while preserving accuracy.

Contribution

It introduces a method to utilize IMMU for high-precision matrix multiplication, showing advantages over traditional FP16 Tensor Cores and existing schemes.

Findings

01

Integer Tensor Cores outperform cuBLAS in double-precision matrix multiplication

02

Quantum circuit simulation accelerates by up to 4.33 times with maintained FP64 accuracy

03

The Ozaki scheme effectively uses low-precision units for high-precision results.

Abstract

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

enp1s0/ozimmu
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Quantum Computing Algorithms and Architecture · Advanced Data Storage Technologies