Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme
Angelika Schwarz, Anton Anders, Cole Brower, Harun Bayraktar, John Gunnels, Kate Clark, RuQing G. Xu, Samuel Rodriguez, Sebastien Cayrols, Pawe{\l} Tabaszewski, Victor Podlozhnyuk

TL;DR
This paper introduces ADP, a GPU framework that uses extended Ozaki decompositions to emulate FP64 matrix multiplication efficiently with reduced precision tensor cores, ensuring accuracy and high performance.
Contribution
It presents a novel GPU-resident framework with hardware-agnostic estimators and improved decomposition schemes for reliable FP64 emulation using low-precision tensor cores.
Findings
Consistently preserves FP64 fidelity on challenging inputs.
Achieves up to 2.3x and 13.2x speedups over native FP64 GEMM.
Incurs less than 10% runtime overhead.
Abstract
The rapid growth of artificial intelligence (AI) has made low-precision formats such as FP16, FP8, and, most recently, block-scaled FP4 the primary focus of modern GPUs, where Tensor Cores now deliver orders-of-magnitude higher throughput than traditional FP64 pipelines. This hardware shift has sparked a new line of algorithm research: using low-precision units to emulate double-precision accuracy through schemes such as Ozaki decompositions. We advance this direction with Automatic Dynamic Precision (ADP), a fully GPU-resident framework that makes emulated FP64 matrix multiplication both efficient and reliable. At its core is the Exponent Span Capacity (ESC), a hardware-agnostic estimator that conservatively determines the decomposition parameter (also known as slices) required to achieve FP64-level accuracy. Built on ESC, ADP integrates exception handling, run time heuristics, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Numerical Methods and Algorithms · Cryptography and Residue Arithmetic
