DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Daichi Mukunoki

arXiv:2508.00441·cs.PF·September 26, 2025

DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Daichi Mukunoki

PDF

Open Access

TL;DR

This paper explores using FP8 Tensor Cores and FP64 emulation with the Ozaki scheme to perform accurate matrix multiplication on GPUs, aiming to improve performance for scientific computations in low-precision AI hardware environments.

Contribution

It introduces a novel approach combining FP8 Tensor Cores and FP64 emulation within the Ozaki scheme for efficient, accurate matrix multiplication on modern GPUs.

Findings

01

FP8 Tensor Cores can be effectively used for matrix multiplication with the Ozaki scheme.

02

FP64 emulation based on integer arithmetic eliminates the need for hardware FP64 instructions.

03

Blocking techniques significantly accelerate FP16-based matrix multiplication implementations.

Abstract

As the demand for AI computation rapidly increases, more hardware is being developed to efficiently perform the low-precision matrix multiplications required by such workloads. However, these operations are generally not directly applicable to scientific computations due to accuracy requirements. The Ozaki scheme - an accurate matrix multiplication method proposed by Ozaki et al. in 2012 - enables FP64 matrix multiplication (DGEMM) using low-precision matrix multiplication units, such as FP16 Tensor Cores. This approach has since been extended to utilize integer arithmetic, offering lower computational cost compared to floating-point-based implementations. In fact, it has achieved higher performance than hardware FP64 operations on GPUs equipped with fast INT8 Tensor Cores designed for AI workloads. However, recent AI-oriented processors trends have shifted toward improving the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Superconducting Materials and Applications · Advanced Data Storage Technologies